 So is there some chalk here, or is there some chalk? OK, so just to recap briefly on where we stand, what we did yesterday. I hope you all nicely recovered from yesterday's lecture. I heard it was tough for some. I'm glad that you still have decided to come today. So OK, stochastic control problem is a following. We have a dynamical system, which is given to us. It can be written as some sort of a stochastic differential equation, where ever the dynamic changes x incrementally in small time steps over a course of time. And then there is noise added to this system. And then we have a cost, which consists of a pass term and it consists of an end term. And we want to minimize essentially this cost. So all these terms are given. We know everything. We just have to find the control u, which is a function of x and time, such that this expectation value gets minimized. So the mental picture that you should keep in mind is that, for instance, if you have that the change that the dx of t is just a u dt, let's take the simplest case, plus dwt. So then it's f is just u. And suppose our cost is the expectation value of some end cost phi xt, which may be something that I want to be on the origin. So at the end time, I want to be in the origin. And then I have an integral of some path cost, which is this r, which can depend on the state and can depend on the control. But the simplest case, it will just depend on the control maybe in this way. We guessed this thing. So if we start here at an initial condition, we want at the end time, capital T, we want to be as close as possible to the origin. And so we have to balance two terms. It's this term and this term. If we steer very hard, we take a large control, get us really close to the origin. This term will be large. And this term will be small. If we don't steer very much, this term will be small. And this term will be large, because we end up somewhere over here. So our optimum will be somewhere in between. And so there is noise in the system. So the trajectory may take you somewhere here or it may take you like we somewhere there. And depending on where you end up at this time, your control will be pointing some direction. And this optimal way to steer is this optimal control function u of x and t, which is the objective of the problem, the thing that you want to find out of this exercise. So that's the solution that you're looking for. Now we have seen that the way to solve this is to introduce something which is the cost to go. And this is from any intermediate time in space. And any intermediate time, t, say, we record a quantity jxt, which is the solution of the control problem from that intermediate time and that intermediate space to the end, for the same problem. We just replace 0 by t, and then we solve this problem from this intermediate time towards the end. And this is a scalar thing. This is a number, depending on space and time. And this quantity satisfies a partial differential equation that we've seen, and this is Bellman equation. I started off yesterday this lecture that this differential equation can be viewed in discrete time, just as a map over time. And then we can take the time limit to 0 and we get this partial differential equation. And this equation is given here on the bottom. This is this partial differential equation that it satisfies. It is the scalar value j. It is expressed in terms of the noise, in terms of the drift term, in terms of the cost that you have, and the end cost that is given as a boundary condition on j. So we initialize j at the end time with j equal to phi. So we said j x t equal to phi of x. And then we run this partial differential equation backwards in time and get the solution for all times. So that's the objective. And then when we have that solution, then the u at each time is given by the minimization of this right-hand side. So here you see a function which depends on x and t. And if we minimize it, if we find the minimizer of this right-hand side at that point, then we find the optimal control at that particular point. Now, you have to appreciate that this is, of course, a very difficult exercise to do in general. So people do this. And in practice, you can do this for simple problems with up to maybe five at most, maybe 10 dimensions to do this partial differential equation to do by discretization of the space, discretization of time. So that's basically all you can do. Or you look at special classes of problems. And in particular, the linear quadratic control problems are easily solvable. So if the dynamics here is in this case, if this f is a linear function of x and u, and if the cos, phi, and r terms are quadratic functions at most, then everything's nice and you can get sufficient statistics. Sort of like the Gaussian case of control where you can write the solution in terms of a bunch of, where the j becomes essentially a quadratic form, j becomes something like half x times px plus alpha transpose x plus beta. So we can write it in this way. And now, the all the time dependence of the j, all the time dependence is now in this alpha, beta, and p. And so, as always, this partial differential equation with these ansatz reduces to ordinary differential equations in terms of these p, alpha, and beta. And these, since the x may be some n dimensional vector, p is an n by n vector. So it still can be quite a large thing, but it's still quite manageable even up to thousands of dimensions. So you can do that thing. Yeah. This is an example. Yeah, so here, as I see on the slide, so this r can be an arbitrary function of x, u, and t. Arbitrary. No. So, OK, as I said, so in general, so you can, for the general setting of this problem, you have to discretize space and time and make this partial differential equation discretize it and use these finite element methods to solve this. And if you do that, it really doesn't matter how complex r is anymore. Yeah, numerically solving. Yeah, so analytically solving, no way, no way, no way. So, yeah, yeah. And in the case of linear quadratic, as I showed yesterday, you get the Riccati equations, which are significantly simpler, but still cannot be solved analytically, except in very special cases. As I gave an example yesterday, but in general, also not. But then it becomes at least tractable computational, whereas, of course, if you have to discretize the PDE, you get the curse of dimensionality that the number of grid points scales exponentially was the dimension of the problem. So that's the setting. OK, so to make advance, so we're now going to go into this class of path integral problems, where we can make some advance. And this is here, now the dynamics, we're going to put in this way. So whereas before, we had the dynamics, which was dx, which was f of x u t dt plus noise. We now split this as a function. We write it as a function of x and t dt plus some other function g of x and t dt times u of x and times u, say u plus dw plus. And then this noise, we write it with the same function g. So we're going to write g x t dw t. So that is to say, we split the u dependence that it depends linearly on the dynamics. So it's linear in the dynamics, the dynamics are linear in u. But it can be multiplied by some arbitrary function. Now, this may be something in n dimensions, n dimensions. And this u may be something in m dimensions. And so this g is a function, which is n times m matrix in general. So this is a matrix. So this is a n dimensional vector. This is n times n matrix multiplying this vector. And the same holds for this. So we can write this in this way that we have it over there, that we have the f, and then g multiplying the control and the noise. So that's one simplification that we make. The second simplification that we make is that we say in the cost function, this r, we make also a separation of the, we peel out the cost dependent, the control dependent part. Here the u, we say that it is this r, which was before here. We now write as something which only depends on the state and on time, and something that is quadratic in the control, cost u, and control u. And so in a sense, so if you think about this system in the absence of a control, so if you put the control equal to 0, then this is an arbitrary dynamical system, because this is arbitrary, you can put any f here. But the control acts on it in sort of a simplified way. It cannot act in any way like we had here this way, but it has to act additively. And it also acts in the, it acts the same way on the dynamics as the noise acts on the dynamics. So this is a particular assumption that is made here. So if we put this dynamics now in the Bellman equation that I just showed you before, then of course we get that this r term becomes now this term, this whole cost term becomes this. And we get that this f becomes now s plus g times u. That is this drift term that we had before. And the noise term becomes now the covariance of the noise. We get that the variance of the term g times dw is something of the form g nu g transpose, where nu is the variance of dw. So we get this kind of identity. So therefore we get this kind of term here. OK, so then this doesn't look like much yet at the moment. So we still have to solve with the same boundary condition. So phi is the arbitrary, v is arbitrary. This is a matrix between these m dimensional vectors. And now, but now we have something which is quadratic in u and linear in u. And so we can do the minimization with respect to u. Yeah, question? First question? There's a bracket. There's a bracket. It's a bracket. And so now we can minimize with respect to u. So it's quite easy. So we take the derivative of this term. Let's see how we are. So we have this, what do we have? So let me take u transpose r u. So what is the u dependence here? And we see we get gradient j transpose times g times u. And we have to take the derivative of this thing with respect to u. And so it gives us r times u plus gradient j times g is equal to 0. And so we find that the solution is given by the solution that u is minus r minus 1 gradient j transpose times g, which is, oh, well, I've got to reverse. I'm also confused with these. Let me see. So we get, in fact, so this is a matrix times a vector, right? So this is a column vector. And so this is a, so j is the gradient. So the gradient is with respect to space, right? So the gradient of j is n dimensional vector, right? And this g is n times m. And so in fact, we have to put here the transverse of this. So we have to take the transverse of this. So we get g transpose gradient j here. So this is the solution that we find for u. And so we can fill that big into the equation. So we replace u by this term. And of course, then we get this very ugly expression over here, right? So we get, let me do it. So we get, for instance, that the u transpose r u, we now fill in this thing. We get 2 minus sign, so we can skip that. So we get u transpose. We comes gradient j transposed g r minus 1. And then we get r. And then we get r minus 1. And we get g transposed gradient j. And so this thing becomes r minus 1. And that's this first term that we get over there. And then we have the, so let's me just see. So we have the v term. Nothing happens. Then we get this gradient times f. Then nothing happens. And then we have to compute a gradient j g u term. So let me see. So gradient j g u. And then we put in u here. So we get now this becomes minus gradient j g r minus 1 g transposed gradient j. So we see that essentially we get this term. We have it with a plus 1 half. So we get a plus 1 half here. And here, this term, we get the same expression, but with a minus 1. So these two, they cancel. So if we add them, we get plus 1 half and minus 1. So I get here, in this expression, I get here the minus 1 half term taking that. And then we're left with the rest of the terms that were untouched. v, this term, this multiplying f, and this term all don't depend on u. OK, so now we got this very, very ugly expression here. And now the trick happens. So if we now define psi through this log transform. So in other words, we express j as minus lambda, which is a positive number, times the log of psi. And this is the definition of psi in this way. And we fill this in here. If we take the then, let me see. So OK, so this is the gradient. So you get the gradient of psi you get here, then the gradient of psi squared term. And here, you get the second derivative of j. The second derivative of j gives you, let me just do it a bit sloppy. So if you take the gradient of j, it gives you minus lambda 1 over psi, the gradient of psi. And if you take the second derivative of j, we get minus lambda. And then we get two terms. We get minus 1 over psi squared, the gradient of psi, times the gradient of psi. And we get a second term, which is minus lambda 1 over psi. And then we get the second derivative of psi. So if we make this substitution, then this first term will give something which is quadratic in the gradient of psi. And the second term will give something that has two terms. One is quadratic in the gradient of psi, which is that one. And the other one is a second derivative of psi. Now, this first term is proportional to r. And this last term doesn't have as this quadratic in j. Now you can show that if you take this condition, that this quadratic term, that it cancels. And that means that the only term that survives is this diffusion term, this second derivative term. And that is only true if I choose this matrix r. So that's this first term. So I have to have a relation between this matrix r and this matrix nu. And it turns out that this cancellation occurs if these two matrices that the inverse of this matrix nu is proportional to this matrix r, or with a proportionality, which is this lambda, exactly this number. So yeah, that was the question. Yeah, so it's actually, it's about column vectors and row vectors. So in fact, this is a matrix. This is a vector multiplying a matrix, right? Yeah, but that wouldn't kill a matrix. Yeah, so it has to be the other way around. Yeah, so it actually has to be the other way around. So it has to be g, this, right? Is that, is that, no, no, no, no, no, no. No, it has to be g transposed this. Yeah, yeah, this is how it has to be. Is that your question? Yeah. OK. So if we make, it's a bit, so you can find it on the slide. So here the derivation is here in all the components with the high dimensional things. So you can go through it. It's not very enlightening, but it's tedious. And you can do it. But the end result is that this nonlinear term, that is nonlinear in J, can be made to cancel, fall out if you may have made this transformation and you assume that this relation holds. And I come back to this relation in a minute at what that means. So that means that if you do this and you collect all the terms, all the terms will become linear in this psi, right? So this is what happens then. So OK, so if you do this, and I don't think it's particularly enlightening to go through this whole derivation, but if you just do this and follow this and assume this, then what you will get is this Hamiltonian-Yabl-Coby-Belvin equation. That's what comes out. Now this is still a partial differential equation. The mean over u is gone, right? We have solved for that. We have done the minimization over u. So the u is out of the problem now. It is still has a high-dimensional partial differential equation, but it's linear in psi, so that's good news. And it has boundary conditions as we had before. So the boundary condition on J now translates on the boundary condition of psi being this term, right? So we get this now. OK, so this equation that we get which is the linear-Belvin equation. So this is linear in the sense that it's linear in psi. So now we can interpret this. As I remind you of the lecture of yesterday, we had this Kormogorov backward equation. And I'd also use the notation psi. So I use the same notation for this thing to highlight that. So we can now actually, if we think of a stochastic process which is given by a conditional probability to go from x at times small t to z at time big t, if we identify this with a psi with a dependence on the first components, which was the backward equation, then in fact, this is a backward equation of a stochastic process. And we identify, if I go back to the slides, we already saw this term, which was the drift term. And we saw this term, which was the diffusion term. There's a new term here, which we hadn't seen before, which is multiplying actually psi, not with no gradient terms there at all. So this is a new term. So what is this, what actually what this term do, we will see, we can see that if we remember that, okay, let me just give the answer. So this process that this corresponds to is a process consisting of a drift and a diffusion, which is just this component. So the drift means that the stochastic process looks like this, the diffusion is like this. So it is in fact the same, sorry, for a different notation. It's the same as this process if I put u equal to zero, right? So we get this f, and I get this g times w, that is the process that you get. So that is these two parts of this equation. And apparently this runs in parallel with something which does minus v over lambda at each time step. And this is a killing process. This is something that destroys probability. So far we have seen the process of this diffusion process called service probability, the integral. So if we had this probability before of x comma t given x zero t zero, we had this probability distribution, this diffusion, right? We had, at the initial time, we had something starting here, which is a very sharp distribution. Then at later time, it gets a broader distribution. But what we would have, that if we integrate this over all x, then we will find that this is equal to one, and this is true for all t, right? So this is always normalized, right? There's always a Gaussian getting wider. Now, by introducing this extra term that we have here, this is killing probability. So it means that this is no longer true, right? Probability gets destroyed, or it creates it. And so the way to model that stochastically is that we run these two terms that we will give in this normal stochastic differential equation. We run it in parallel with the killing process. So at each time, we say that we kill the particle. So if we do a diffusion, if we start with particles, at each moment of time, we have a, we take an infinitesimal probability, v dt over lambda to kill that particle, right? And that is the process that is being described here. And the same at the end time, the killing was a probability, sorry, over lambda. And so this is, so as we have seen before, we have this trinity of the stochastic processes. We have the partial, we have the stochastic differential equation, we have the Kolmogorov backward equation, and we have the Fokker-Planck forward equation, also here in the presence of this killing process, the same holds. So here we have the backward equation, here we have the stochastic description, and there is also a forward description of it. And it's this kind of Fokker-Planck equation, which is the same as we had before, except there's now also this term here, right? So if you, if you want, if you worry about the, if you, so want to understand the normalization issue, for instance, then if you, if you look at the integral over dx of the rho dt, so that is the, if you integrate over all space that, so you get now it is the integral over dx of what do we have? We have minus v over lambda rho, plus rho on F rho, plus one half trace, lambda squared g nu g prime rho, right? So we have this. So you see that, that these terms, this term and this term is a full derivative, right? So if you integrate over all space, you will get boundary terms. And since these probabilities follow at the boundaries, these terms, these integrals will give zero, right? But this thing is not a gradient, it's not something, so this is actually not conserving probability. So this is meaning that the, that the change in the volume of the probability is changing over time, right? That's why that's happening. Okay, so funny enough, we have now changed the direction of time, right? So if we get this Bellman equation, we had to solve with n condition going backwards in time, then we have identified it with the Kolmogorov backward equation, which corresponds to a stochastic process which runs forward in time for which we have a Fokker-Planck equation description. So, okay, so we have changed the direction of time. We can go further. We can say that there is a, since this equation is linear, there's a famous theorem by Feynman and Katz that gives you the closed form solution in terms of a passive rule. And that is the following statement here. So suppose that we now introduce a distribution over trajectories, which are the trajectories that are generated by this stochastic process, sorry for the clutter of notation, this d size is the same as the dw we had before. And all these trajectories, they start at the same state and at the same time. And so we get different runs and the picture that you have is that you start here, you get these different trajectories and they are like that. And so you get the, so each of them is a trajectory and you get a distribution over trajectories, right? And this distribution over the trajectories, you call Q, Q of, of tall, tall as a trajectory. And so then, if this is the case, then you can show that if you define psi as the integral of all the trajectories under this measure that you generated from this dynamics and you weight each of these trajectories with e to the minus s of the tall, where that s is now the end cost plus this state dependent cost. So remember that we had a control problem set up where we had three terms, we had the end cost and this term. And so this s is just these first two terms. It doesn't include this last term. So that is the s. So if you define this function with s, these terms, you can show that psi is a solution of this Bellman equation with the right boundary conditions. Right? So that's a big trick and I'm gonna show it. I'm not gonna show it why that is. It's quite a lengthy derivation but this is very much a cornerstone of this kind of this idea here. Yeah. Yeah. So if you think of a diffusion, you are evolving in terms of the Fokker-Planck equation, you're evolving a conserved quantity, right? Now in this case, this is not a conserved quantity. And so it means that, but so normally if there's no killing, then if you start with n particles, then you have n particles at all time, right? If there is killing, you will lose particles and you may run out of particles at some point to do your statistics. So then you have to actually generate particles at intermediate time to, no, to add new particles in your statistics because essentially there is a sort of, okay, this is a, okay, so this minus v, this v depends on the state, right? So in this state, for instance, the probability to get killed is much higher than in this state to get killed, right? So you get an imbalance of the killing of the particles. But you could add a global constant to the v or subtract it, making changing it to sign. So actually this, and actually actually changing this global constant doesn't change the physics at all, it doesn't change the property, doesn't change the control problem because if I change my original control problem, if I add a constant to here, which doesn't depend on the state, it doesn't do anything, but in this killing formulation, you see that it's arbitrary, the sign of this is in a sense arbitrary, it's only about the relative value of the value here and the value there. And that means that at each time you start with 100 particles, at the first time step you kill say 10%, you can generate another 10% to keep your total value of particles, keep that constant, there are some tricks to do a regeneration of these particles, to keep the total number of particles the same. So that is, yeah, that's the way you can do that, to simulate this killing process. No, essentially you have to, okay, yeah, what you can do is that you do something of, so you make an empirical distribution of your particles at this time that you have and then you generate n particles from that empirical distribution. So you say, okay, I have on my state space, I have here a particle and here a particle and here a particle, the rest are all killed and I'm gonna generate particles on these spaces with uniform probability. And so this is the way I generate new particles, I just make copies of it here. At the state, at the ones that you're there, that's just the way you replenish your set of particles. This is the way you do it. It's the same that happens in particle filtering, which is done in signal processing, the same thing that's called resampling. Okay, so we got this, so this gives us a solution. So we have to weigh each particle with, each trajectory with this factor. Okay, so that's that. So, okay, so this is one way to get to this path integral control. So now we see that we have the optimal cost to go, we have it in an explicit form. But I want to give you another way to derive the same result and this is in the following way. So suppose that I'm in a discrete space and I'm starting here and I want to get here to the gold or to the diamond in this case and I want to avoid obstacles and this is my problem, right? This is my control problem. I have some, okay, so I have, this is the setting. So I start in the initial state and it can generate all kind of dynamics, right? I have the rules of the game which is my world. So the rules of the game in this case are that's from this position, I can move right, I can move right and maybe diagonal also, but it will give me a high penalty, right? So there is some North, East, South, West kind of dynamics which are the rules of the game that they can do. So from each state, I can go to a number of other states and suppose that I make a diffusion process which just imitate these rules of the game. So over another example, think of chess, right? They make the game of chess. There are the rules of the game that means that the rook can make certain moves and the pawn can make certain moves, et cetera. And I can use random dynamics, random futures of this game and this gives me some trajectories. And this is a diffusion process and this diffusion process I denote by this queue which is the same queue that I was talking about before. So this is so-called the uncontrolled dynamics. This is the dynamics that I get if I just do random movements. It's the same that I got in this formulation here that I started with if I set the control equal to zero. That is the kind of dynamics that I get. I get just brown emotions, right? I get that. Now I can identify queue with the distribution that I get over these uncontrolled dynamics, right? Now I have a, now I set a cost which consists of a path cost and an end cost which is I define. So for each trajectory, I get a cost. And now I want to find a distribution over trajectories different from queue that minimizes this expected cost, right? This expected cost. And at the same time is close to my uncontrolled dynamics. So I have this kind of an objective, right? So this is a distribution, it's sort of a Gaussian shape and I want to move that Gaussian shape a little bit, not too much, but in the direction that it gets a good expectation value of this S. Now, for instance, if the S is only end cost here at this point, it only favors positions here, then you see that the red trajectories will be the trajectories that have a low expectation value of S and then there is the other penalty that you want it to be sort of close to queue. So the way to formalize that is that you say, okay, I'm gonna make a cost which is a function of a distribution, right? I'm optimizing this distribution over trajectories, P of tall, and I minimize two terms. One is the expected S and the other one is the KL divergence. This is the distance between P and Q, right? Where the KL divergence is the P log P over Q and where the stochastic variable is the trajectory itself, right? So we have some of all trajectories. This is this KL divergence. Now, in this formulation, the optimization of P is actually very, very easy. So easy that I'm gonna do it for you here on the board. So I just have that C is the integral over the tall of P of tall log P of tall divided by Q of tall plus the integral. So we get P of tall times S of tall, right? So this is our whole cost function that we have to optimize, right? This is the expectation over S, the sum over the integral, sorry, is here. So if I now do the C, the P of tall, I can very easily do that. It gives this first term, it gives me the log of P of tall over Q of tall plus, and then I get P of tall times the derivative of this thing, which gives me just a one over P of tall. And then I get the derivative of this thing, which gives me the S of tall. Ah, there's one thing I have not told you. Of course, I have to do this optimization under the constraint that this distribution is normalized, right? So I have a normalization constraint that the integral of all tall of P of tall is equal to one, right? So this is normalization condition. So I have an extra Lagrange multiplier here plus lambda integral dT P of tall minus one, right? This is my Lagrange multiplier. And so if I take the derivative that also there, I get an additional term lambda here. So I get this and I have to put that equal to zero to solve for the solution. And it's very easy to see that this gives just a constant that you get the solution that P of T is proportional to Q of T e to the minus S of T, right? This is the solution that you get if you solve this equation. And so the proportionality is solved by the normalization condition, right? And what is the normalization condition? So we get the normalization that the integral over Q of T e to the minus S of T is a normalization. Ha, but that thing we already seen because we know that that's Psi, right? We saw, remember that we saw that, so it's here on the slide, right? It's Psi that we defined in the Feynman-Katz formula. So this is Psi, this is actually Psi, it's a constant. So just to remind me that I say P of tall, it actually depends on the initial condition, right? X and T zero, say. And so all these quantities depend on that. And therefore, also this thing depends here, this Psi, depends on the initial condition, X zero and T zero, X and T zero. Because these trajectories depend on X and T zero and also this, well, this calls maybe not. So there's this relation. And so we find the solution that is of this form here that I wrote it down here, and this is the normalization constant, is in fact what we previously identified as the optimal cost to go. Because this Psi is actually the solution of the Feynman-Katz formula that we saw, right? And it is related to, we have the relation that is the J of the cost to go is minus lambda log of Psi X, Psi XT, right? So we have this relation. So we see very easily that, ah, again, so if we now take this optimal solution and we put it into the cost, then it's very easy to see that some term circuit, well, I'm not gonna do it, but if you put it in, what you find is that the optimal cost is actually minus the log of Psi, right? So that's exactly the same statement as this statement here, is the optimal cost to go is minus the log of Psi. And I've put a lambda equal to one in this derivation here, right? Ah, I should tell you something about this relation. So, let me do that now. So there was a relation that lambda has to be equal to R times nu. So that is to say R is a matrix that happened in the cost, right? We had one half U transpose RU, this is a matrix in there in the cost, as some terms, and the nu was happening in the covariance of the noise, dW squared is nu dt, so this dW, let's say it's a matrix, let's put it like this, dWi dWj is nu ij dt. That's the covariance in the noise in the stochastic process. So these are completely unrelated, but in this pathological formalism, they have to be related in such a way that one is proportional to the inverse of the other, right? So this is what you find there. So what does that mean? So suppose that you have a two-dimensional system and R is diagonals, R1, R2, and we have noise nu, this is R, and we have the noise is equal nu 1, nu 1, nu 2, with zeros here, so we have this kind of thing. So you see that this condition now means that you get that lambda is equal to R1, nu 1, R2, nu 2, zero, zero. And so basically this, it means that this, and this left hand side is lambda, is an identity matrix like lambda, lambda. So we need that lambda is equal to R1, nu 1, is equal to R2, nu 2, right? So we can only do that if we have this relation between these variables. So that means that if the noise in a certain direction is zero, it means that the, so okay, so if we, let me see how it is. So we can take, so we take R i is nu i divided by lambda, right? So this is a solution that we can take. And so we see that, no, sorry, R i is lambda divided by nu i. So if we keep lambda fixed, we see that if we make the noise in one direction, we can make it zero, right? The covariance of the noise in one direction, this covariance of this noise process, make it zero. It means that in that direction, the control penalty goes to infinity. And it means that the control value, the solution that we get goes to zero because it gets infinitely penalized, right? So in the directions where, so if i is a direction where nu i goes to zero, it means that R i goes to infinity, and it means that u i in that direction goes to zero, right? So it means that directions where there is no control, there's no cost, there can also be no control, right? So that's the limitation of this passable formalism that is sort of a balance between directions that are noisy and to the extent that they are being controlled with the extreme, that if there is no noise in a certain direction, that direction can also not be controlled. So for instance, as a particular example, so if I were to take a second order system, right? I say if I take x or z, z double dot is u as we have before with this control spring system, so we make it in two variables where we say z is x1 and x2 is z dot, right? We would get a system here x1 dot is x2 and x2 dot is u, right? So we would get this kind of system. Now from this, from the philosophy that we developed here, we see that if we want to make this in a stochastic system within the passable framework, we can add noise here. Sorry, sorry, u, dt, but we cannot add noise here, right? Because we cannot add noise here because then we also would have to add control there, right? So that is, so this is a particular limitation of this passable control case. So another example is for instance, if I have two Brownian particles, then if I have dx1 is u1 dt plus dw1 and I have dx2 is u2 xt with no noise. Let me see if I do this correct. No, suppose that I have one particle which is not under my control, right? So I have this just Brownian motion and the other particle has a control and has noise. Then actually I cannot, this is not a particular example of the passable case because I have no control over this noisy direction. So I have to, this is outside of the passable control class. Any questions there? Yeah, I'm saying that I'm, so this is a deterministic system, there's no noise. If I now want to make it in the stochastic system, I can add noise here, but I cannot add noise here because I have no control here. It is, so if you have this, so you cannot write this in the form that I started off with in this form. So you have to have, so I have to have it in a vector form that you have what it is, right? So adding this noise would mean that I also have to have a control, which I don't have. So they go hand in hand, that's the point. So this is fine, this is fine. This is fine, yeah, sorry, yeah. So this you can do, this is passable control, but you cannot add the noise here. So I will show now that in the case, maybe you're referring to that. So in the case, so this story is quite general, and you see that the cost is now consisting of two terms, which is this term and this term. And previously, the cost consists also of this term, which we had, which was the phi and the V, and this one we had to use square term. And you can, the integral over use square term, and you can show that if you now go to the special case where these control diffusions are given by this stochastic differential equation, that in fact P corresponds to trajectories over this diffusion process, and Q corresponds to trajectories over this diffusion process with U equal to zero. And if you then, for these particular P's and Q's, compute the KL divergence, it will turn out to be this term. So it will actually recover this U squared term that we had before, so it's all the same. And this condition of lambda, this condition is automatically that you get is nu minus one as the value of R. Sorry? The KL optimization. Yeah, so quadratic U, right? So, but it can be arbitrary function of the state, and the dynamics can be arbitrarily non-linear. It's just quadratic in U. Yeah? That's your question? No, so if in the case that you have this Brownian, Brownian noise case, the KL formulation reduces to a term quadratic in U. If you have other noise processes, the corresponding control term will not be quadratic in U. It will be something else. So if you apply this formalism to stochastic dynamics, which have not Gaussian noise, but have some other kind of, say a discrete state space with some sort of Poisson noise, then if you apply this formalism, it would imply, this KL cost would imply a particular other term here. That's your question. Okay, so this is, so now we have the optimal cost to go. We can get it as the log of this psi, and the psi was given as a integral like this, and it can be sampled, right? So we can sample trajectories under Q, which is this uncontrolled dynamics, which is the dynamics in the absence of U. So we generate trajectories. We weight all these trajectories with e to the minus s, and then we get some estimate, and which gives us an estimate of the optimal cost to go, and we have solved our control problem without ever solving the Bellman equation. So we can solve the control problem as a sampling problem, and that's good news because sampling scales typically better in high dimensions than doing some sort of finite element scheme, right? So this is good. Okay, so here with this, we can get the optimal cost to go. You can actually also compute the optimal control, and I will not derive it, but it's given by this formula, it's given by the fact that U is in fact, okay, I can give a little bit of a hint on that. So U was given as the gradient of J, right? And J was a log of psi. Now think of psi as a partition sum. Then the log of psi is like a free energy, and you take the derivative of the log of free energy with respect to something, in this case, with the coordinates, and what you get out is some expectation values, and it turns out what you get out is this formula, it's too involved to get into it, but what you get out is this formula that you say that the optimal control at a certain point is given by this expectation value where D w is the first step of your noise. So you have a noisy trajectory, let me do it here, so we have a noisy trajectory. Suppose that I have this problem, I start here, and I want to go to the origin at the end time, I make a trajectory which goes like this, and it ends up here, and I make another trajectory which does this, and it goes up like there. Now the expected value of the first noise component, the first D w term, right, the expected value of that is of course zero, because it is just a mean Gaussian zero, but if you weigh this term with the total path that it generates, right, then it gets somewhere, and if you go first into a little bit into the right direction, you have a slightly larger chance to get to the goal than if your first step would be into the wrong direction, right, and so if you weigh this W with the cost of the whole trajectory, then the trajectories that go into the right direction get weighted higher than the trajectories that go into the wrong direction, so this expectation value will then be no longer zero, and the miraculous thing is that, that actually the value that you get out is exactly the optimal control, so that is the miraculous thing. So we suppose, so I start here, there's a pot of gold there, there's nothing there, I do random samples, some go into that direction, and we end up at the pot of gold, some go that direction, they end up nothing, my trajectories, my first step is weighted by the final cost and the ones that point that way get slightly more weighted, therefore my mean gets slightly in that direction, and the statement is that that bias that I compute is in fact the optimal control in that point, right, so you can sample a solution for the optimal control as well as the optimal cost to go, yeah, that's a question. Then you change, I will get to that, so the rest of the talk will be much about that. So I tell you that, okay, so here's a recap, so suppose I want to compute, I have this dynamical system, so this is sort of a simplified slide, but suppose I want to go from the origin to either one of these two goals, and this is my dynamical system that I have, and I have this cost that I want to minimize, so if I don't, I'm not in the path to the gold class, what I typically do is that for this case, I solve the Bellman equation, which is a partial differential equation, and the solution would look something like this, right, so this is time, and this is space, and color, it means the height of the J, and J is very low here because here I'm on the target, and it's very high here because I'm very far away from the target, right, so this is the solution of J of this partial differential equation, and then a solution of your trajectory would look something like this, because you have a noise in this thing which is not under control, the only thing the control's in here, but you can compute the optimal control in the solution, and the trajectories would be very noisy, but would get more or less to the goal, right, this is the kind of picture that you would then have, and so this is what we've been talking about, and so this is the standard way. Now the path to the goal way is that you say, okay, I'm considering functions where they have this split in the state dependent and linear in the control dependent part, and then my cost is also split in this way, and I get this S, and then there is this quadratic term, and we've seen before that this quadratic term is actually the KL divergence that we've seen between these things, and therefore we get this identity with this KL formulation, and if we do that, then we can solve that, and we get a solution that I just pointed out to you that was this solution, we get a P star which is proportional to Q and the weighted trajectories, so we get two types of trajectories, here is optimal trajectories that are all weighted equally because I just have the optimal control computed, and I start a trajectory, and I point, I paint a trajectory, and these are all optimal, and I get a distribution over trajectories, which are the distributions over optimally controlled trajectories, those are these, right? Now this, here, I get grayscale trajectories because here I get trajectories generated from the uncontrolled dynamics, so they're all over the place, that's where this gray area is, but they're weighted with e to the minus S, right? And this e to the minus S is picking out the ones that are, so it's colored here with e to the minus S, and so you see that some of these trajectories are weighted high, and some of them are weighted low, and the miraculous statement is, is that these two trajectories are the same, these two distributions are the same, in the probabilistic sense, right? So these unweighted trajectories from the optimal control give the same cloud of trajectories, same distribution, as the weighted trajectories under the naive control weighted by this factor. Okay, so this is the essence of this pathological control method, and this is known in the math literature as the Gershan of theorem. Okay, so we can, we can compute it by sampling, so now, so let me give you an example, so here's the same example, and now I'm gonna look at the solution for different values of J, I already showed it to you, it was this thing, but I can take different cuts at different times through here, and what you see is actually that is the following, is that there is a, that for large time, for long in the past, is the convex shape, and so it means that since the control is the gradient of J, you're steering towards the middle, all right? And then at some rate of point, you get this symmetry breaking, you get this tilt over, you get these two wells, and then actually you have to make a choice, right? If you're in the middle, you're right on the middle, you have to either fall to the left or fall to the right, and so decision making, in a sense, which of the two targets to go to is here a dynamic process where early in time, you say, well, I'm gonna delay my decision, I'm gonna do anything, just gonna steer in the middle, and at some point, of course, this is no longer good enough at the end because then you would end up in the middle, you don't wanna end up in the middle, so at some point, you have to make a decision for a left or right, but it's not optimal, actually, to make that decision as early as possible, and that is sort of a curious fact, so this is known as delayed choice, so you start here at the origin, and instead of saying, well, okay, my first step gives me slightly up, so I will gonna go to this target, instead of doing that, you're gonna say, well, hold on, just stay in the middle, let's see what happens, let's see where the drift takes me, if the drift takes me to the top, fine, I'm gonna go there, but if it takes me to the bottom, I'm gonna go there, and I have time enough to decide that, I don't have to decide now, I don't have to spend control effort to steer magically to go somewhere because future noise may destroy whatever I have controlled already, and so another way to look at this is that you say, okay, there's these two targets, I have here two targets here, and I wanna steer to one of them, now if I'm here, and my noise cone, the width of this diffusion process, this width of this thing is wide enough to encompass both things, there's likely that I'm gonna get to one of these two, so if my noise cone is wider than the distance of these two is fine, but if I move closer, this is gonna get in between that, and then I have to make a decision, so that's in a sense the process that happens, and so you see that if the noise cone is very narrow, you will have to make the decision very early in time, and if the wider the noise cone is, the later you can make the decision, that is to say, the larger the noise in your problem, the more you can delay your decision, if you have a very deterministic problem, you have to make your decision very early in time, and the more noise you add, the later you can make the decision, and I always make the joke at this point that this reminds me of Christianity in Europe, where we have actually two solutions, we have the high temperature solution in the south, and we have the low temperature solution in the north, and in the south we have Catholics, and the north we have Protestants, and both of them act optimally, and so in the high temperature, the low temperature solution is where I'm from, I'm from the north, and my parents were Protestants, and they, my mother always told me, don't leave for tomorrow what you can do today, right? And that is the low temperature, low noise control solution, do it as early as possible, right? And of course you know the southerners, they have another approach to this, particularly in Spain, they have the famous saying maniana, well we'll do it another day, so they have a high temperature solution, which is to delay the choice more because they have more noise in their system as to say. So this is how this works out in real life. Okay, and you also see it here in football, so here you see there's also a wonderful example of delayed choice, so very sophisticated football players, they can make a planning a large time in the future, and they know that they appreciate that the problem is uncertain, so the ball can go anywhere, so they spread out themselves over the field. But if you look at the, so the beginners of course, they all know where the ball, they're all like on top of the ball, say they don't delay any choice, they just go right for the thing. So they haven't had these, in this picture. In this one, yes, in this case, the cost is, this simulation is actually the same as this, and V is zero, and phi is just these two end points. Yeah, so in fact, the optimal time is in fact in this problem, it's one of the noise. It's somewhere on the slide, I think. So the time to go is two minus T in this case, this is the time to go, and the decision is made at a time which is one over the noise. So the time to go is one of the noise. Right? So if the noise goes to infinity, the time to go, this capital T goes to zero, actually you will make the decision at the end. And if the noise goes to zero, this goes to infinity, the time to go becomes infinite, so it's infinite in the past. It means that I'm starting here so that if you start at a time zero, you only have two, then it depends on whether this time, this two is larger or smaller than one of the noise. If it is, right? So this, yeah. Okay, these are some details. Okay, this is a video of doing this control on a number of drones. So the problem here is that I have a bunch of drones, about 10, and they fly here around. This is a realistic simulator where there's noise and there's turbulence and there's a noisy GPS signal that you can get the position of these drones. And so the task here is that they have to stay close to the central pole and they have to maintain a minimal velocity and they have to not bump into each other and for the rest they can do whatever they want. And so the way that this is modeled is that there is a model which is basically a model each of the drones as a point mass. And so the model is updated several times per second in which the positions of the drones are recorded. So it's a centralized control solution in this case where, and then you have a model where all the positions of the drones are at the end of velocities. And then there is this pathological simulation is done in the future, about 10,000 trajectories. And then from that the optimal control is computed for all the 10,000, for all the 10 drones. And that is these controls are then sent to the drones and they update their position and their velocity with that. And that is then repeated about three times per second and this is the movie that you see. So here you see that they fly around and they get at some point they get into a configuration where they form this pattern. In this case they form this circular pattern after some time. So the pattern that they form asymptotically depends very much on the amount of noise that you put in the problem and also on the amount of drones. You can also get solutions with two concentric circles but in this case you find this one concentric one circle. So this is the pattern that's being established. The second simulation is a cat and mouse scenario where there is four drones or cats and one drone is a mouse. And the only thing that the mouse wants to do is to get away from the cats. So it always computes direction away from the cats and the four cats are under control in the same way and their task is now to catch the mouse and that you see that they do this quite well. And so they have, of course, in order to do that they have to coordinate their actions because if one goes to the left, the other to the right they have to catch together and they compute this coordinated task over a future horizon and in this case this is a future horizon of two seconds. So they really have to plan ahead to see what happens in order to get a successful strategy. Now if we remove the horizon time to one second it illustrates that these cats become sufficiently short-sighted and actually stupid that they don't think well enough ahead that they're actually the control fails and they are not, so it's really needed to have this sort of planning in the future in order to get this coordinated behavior out of this. The mouse is just trying to get away from the cat. So it computes a resultant from each of the cats or the repulsive force from each of the cats and trying to get away from that. No, it's not planning, it's not planning, no planning. It's just trying to get away from the cat. So it just has a stupid strategy which is trying to get away. There's no planning in the mouse. Cat and mouse. Yeah, so that's not here. No, this is just a collaborative control. You could of course do that. Okay, so well so this is some Chinese collaboration that actually did this with real drones so they got it to work but they never got it to really to publish it so I have to rely on this picture of me. I'm not sure about that. Okay, so the last part of the talk which I want to talk to you about important sampling. So I told you that it's miraculously that I sampled this path integral and then I get the estimate of the control but of course the story is much more complex than that. Because so suppose that I want to go in this control situation, I want to go from the origin, I want to go through one of these holes and then go maybe back to the origin again. So if I do the naive uncontrolled dynamics I sample all the trajectories, they all end up into the wall. They all get cost, since the wall has a cost infinite, e to the minus infinite will give a weight zero to these trajectories. So they're all basically absent and only these three trajectories is what I have to base my statistics on. So I get very, very poor statistics, right? And so it will be much nicer if I could sample something like this but of course then I'm sampling in the wrong direction that is not my uncontrolled dynamics so that I'm doing something wrong. But what I think what you can do is something called important sampling. And the important sampling is for those who don't know. So suppose that you have a Gaussian distribution here and you want to sample the estimate probability that x is less than zero, right? So the naive thing that you would do is that you say I sample for my Gaussian distribution and I count the number of times that I get an x that is less than zero and I take a fraction of that and that's this fraction that I get over here and that is my estimate of the probability that x is less than zero, right? It's unbiased, it's a correct but it's not particularly efficient because all the x's are larger than zero as in this card because I just give zeros in my estimate. I don't count anything. So could I do better? Yes, of course I can do better which is called important sampling. So I put a distribution which is a little bit more mass on the left. I can do better. So instead of sampling from the blue distribution I could sample from the green distribution say and then in the writing the probability that x is less than zero I can write it by multiplying and dividing by P which is my green distribution and now I can sample from P and consider this the function that I want to evaluate, right? So I sample sort of the important generate samples from P and now compute the statistics of for each of these points of this ratio and it has a correct expectation value and it has a variance which, well, that's not to worry about that. So you can ask yourself what is the optimal distribution to sample from? So what is it? Do you have any ideas? Who would know what is the optimal distribution in this case? So some distributions are better than other. Clearly. Gaussian, which Gaussian? I have here two Gaussians already. This is a Gaussian with a mean at intersection point, right? broader Gaussian. Any other ideas? What turns out? No, the correct answer is not there yet. By mode. By mode all. Why do we have the two modes? Why do you want to have the two? Okay, these answers are all incorrect. So this is the optimal distribution. The optimal distribution is actually, if you look at your original problem, it is the integrant of the thing that you want to solve. So this product is, you take as a distribution and you normalize it. That's the optimal distribution. So if you take this product, so it is your Gaussian multiplied by your objective function. So in this case, the objective function is zero here and one here. So you get, in fact, only this shape and this shape you have to now renormalize to be norm one, right, to the normal one. So in other words, you just get Q times I is the indicator function divided by the normalization. But the normalization was a quantity A that we wanted to compute, right? So we don't know A. So we don't know this distribution, right? But this is the optimal distribution. So if you, why is it optimal? Well, if you take this distribution and you put it into this formula here, then if you replace P star here, this P star by QI over A, you see that the one that becomes A and then QI, they cancel. So you see that, right? Do you see that? So if I have, if I put P star QI over A, if I put it in here, I see that for each sample XI, I get the QI get canceled and I get the, I get, in fact, A. So for each of the samples, I get the correct answer. So I'm sampling from a distribution, from this distribution, and I get all kinds of different X's. I get different X's, but every time, the estimate that I make is always the same thing, which is A. So I only have to sample once, because if I repeat, I get the same number. There's no need to do it twice, right? So with one sample, I get the optimal solution and I have no variance. I will always get the same solution. So that's why it's optimal. But of course, it's unrealistic because it requires a notion of A, which is a normalization, which I don't know, which is the thing I started out to compute with. But this is the optimal, optimal important sample. So the same holds for the, for the path and goal trajectory. So here we have the same problem and I draw trajectories from the uncontrolled dynamics, which is just a normal diffusion. And I waited with e to the minus s, the trajectories. And I get an estimate of psi, which is just this average, right? Which is now, which is the, our estimate of the path to the goal. So we have psi with the integral of all trajectories of Q of tau, e to the minus s of tau. And what we're now computing here is a sample estimate of that where we sample trajectories. We replace this by, we say that this is roughly one over n by the sum over the trajectories. We sample from this and we get e to the minus s tau i, right? This is the trajectory. And these are just these weights that we add up here. So that's this term. And we can, so now these, we can talk about the sample efficiency. So some samples will have a large weight and some samples will have a short, a low weight. And it's the variance of these weights that will affect, will define our estimate, our efficiency. So if the variance of these weights is very large, the sampler is very, very, very bad. If they ever, if the variance is very small, it's a very, very good sampler. And this, here you see an example of that that we started off with. Here for instance, most of the weights have a zero and some are non-zero. So you get actually quite a very large variance in the weights. And that is the reason why this is a very bad sampler. So you want to have a sampling procedure such that your weights are as equal as possible so that you're sampling as uniform as possible. And this gives you the best one. And this optimal sampler that I had here actually does that because in each time it computes the same number. So you get a variance of zero. So we can define the variance of this way. So we can just define it. And we can introduce something called the effective sample size, which is the how many effective samples I have out of n samples that I use for the sampling. So I start with 1,000 samples and this previous example, only three are left, right? The effective sample size is three then. And so I can relate that in general in this way. And if the variance is large, the effective sample size is small, right? So in this case, the effective sample size is 1.8. I have here, I think 10 trajectories. They're color coded, but you don't see that because their color is almost white. So these are the only two out of 10 that survive. If I now take another kind of controller, so if I now take another distribution, P of t, which is now generated by diffusions, which already take a little bit of the intelligence into account. I take now, so I know in this case, the optimal control. I could generate trajectories with the optimal control and that would be a very good way to generate samples. And here intermediate, I take half the optimal control. I take the optimal control solution as a function of x and t divided by two as you get a suboptimal controller. And you get, in this case, with the 10 trajectories, you get effectively, you get three and a half trajectory. And if you sample with the optimal control, then you get effective sample size very, very high. So you see here that there is sort of a bootstrapping procedure that if you have a bad controller which correspond to a bad important sampler, so here what I didn't say, what I should say, is that so this idea of important sampling is now done by changing the control, right? So we have diffusion process, which is naively generating from the uncontrolled dynamics, but now we're gonna make important sampling by actually generating trajectories with a certain control. And by changing that control, we can make more and more effective important sampling strategies. And so naively, you would think already, if you look at this picture, for instance, that if you're sampling in the direction of the optimal control, and this is a solution which is very quite good, your samples will be very good. You have many surviving samples. And so there is sort of an alignment of two objectives. One is to compute the optimal control, which is to minimize the control cost. And the other way is to actually, in order to compute it, you need to sample and you want to optimize the sampler. And to optimize the sampler, you also need a control and the control that you use for the sampling is better in terms of effective sample size. And now it turns out that these things actually go hand in hand. And we can show a theorem that shows that if you have a better controller in terms of optimal control, in the sense that it minimizes the control cost, that controller is also better in the terms of sampling. So it's a better important sampling in the sense of the variance, in the sense of the effective sample size. So the picture is you start with a lousy controller because you don't know anything, maybe you control zero, you get some samples, from that you are able to construct some sort of a controller. And with a controller, you can now sample again and get better estimates, better particles, better statistics, and you can compute a better controller. And with a better controller, you can again generate samples. And with these samples, you get a more refined set of samples and you can generate a better controller, et cetera. And so you can also show that if you sample with the optimal controller, that if you have already in advance the optimal control solution, then in fact that is also optimal in the sense of sampling. So and what comes out is that, so if you look here at the formula, so if I sample from a intermediate distribution of the trajectories, then I have to multiply this cost that I get with this important sampling ratio, that I had before also, that I had here, this important sampling ratio that you sample from the wrong distribution to correct for that. Now this ratio can be included into this exponent by changing S to SU, and if you do that, it actually, it becomes SU, becomes S plus these two U dependent terms. And so this is then the general formula for the cost that you use for discounting. If you sample with control zero, you just get the S, but if you sample with a control which is non-zero, you get this additional term here. And now the miracle is that you see that the samples are weighted with E to the minus SU and so they are weighted with these terms and we just have understood that if we have the optimal control solution, we sample from the optimal control solution, we get actually that the variance of the weights is zero. What does that mean? That means that these numbers are always the same. They have no variance. They're always the same. Now these numbers are E to the minus S, so that means that SU, for every trajectory that I take, is given to give the same number. Now that's miraculous. So this number S, if this is a stochastic quantity, there is noise here, there is noise in the X, there is noise all over the place, but if and only if I replace this function U which is my optimal, by the optimal control solution, then miraculously all noise cancels and whatever trajectory I put in here, I always get the same number, right? That becomes a deterministic function. It's a very curious fact. If and only if for the optimal control. So this is the notion of the important sampling. You'll find more in this paper. We're getting a little bit, we're out of time, right? This is, are we out of time? I still have 15 minutes. Okay, do you still have 15 minutes? Okay. So here's the, so here's also a proof of this, that this S becomes deterministic. So here you can, this is a very easy one line proof. So it's just curious. So remember that we can write a control cost, we can write it as an expected cost plus this log likely, this KL term, right? And so this can also be written as the expectation of SU. Remember because SU was, this extra term is the log of P over Q, of Q over P in added in here. So we can also add it here. So it's expected for us. So now we can write that the optimal cost to go is the minus log of Psi, which was this term that we had here at the beginning and on optimal control. And we can also now replace this Q by P times Q over P. And so then we can put a Q over P, we can put it in exponent, we can write it as this. So this is just the same thing. And here we recognize that this is minus SU here. And now we can use Jensen inequality. We can say that the log of the expectation is less than the expectation of the log. So we can interchange these. So we get the sum over tau P of tau. And then we get the log of the exponent. So we get just this term here. And this one is just this term, this C of P. So in all, I've done nothing else in saying that the optimal cost is less than any cost. So this is not a very deep statement. But you see that you have here this Jensen bound in between and you see that this Jensen bound gets tight that inequality becomes an equality only if this quantity becomes noiseless. Because if this quantity becomes noiseless doesn't vary then you can take it out of the expectation value if and only then. And so that's why if this becomes noiseless you get that this bound becomes tight. And that means that you have your P is actually the optimal control. And that is a very brief way of seeing that if P is equal to the optimal control, optimal trajectory, the distribution over optimal control trajectories that in that case, the variance in SU is actually zero. And that's that's follow us from this group, okay. Okay, so I told you that if you have a better controller you can sample better. But now of course we need to find a better controller and we have to parameterize that so learn that somehow. And so this is what we can do. So here we have our estimation problem. And we know that the optimal sampler, the optimal important sampler is this optimal control distribution. So this is right, it's the argument and then renormalized right and the normalization constant is intractable. This is the same what I showed before for this very simple Gaussian example. So this, so now we would like to sample at this but we're now gonna, what we're gonna do we're gonna approximate this optimal distribution by parameterized distribution which is parameterized by a controller, right? Because it's a distribution over trajectories and these trajectories are generated by controller. And now in addition, this controller is gonna be parameterized by some parameters because we cannot completely specify this controller. So we have this situation. And so now one way you can do this is the so-called cross entropy method where you minimize the KL divergence between P star, which is what you want and P of U that you have. And this is the result. So this is looks very strange because you would think, well I don't know P star so how can I ever minimize this distance? But actually it turns out that you can compute the gradient quite well. And so the gradient of this is given by an expectation value over trajectories under your current control weighted by E to the minus SU of something integrated over the time of your noise times the gradient of this thing. So it's some sort of a complex term. And so these are, these gradients can now be estimated by sampling and then we can change the parameters to learn a better and better sampler. And at the same time we're learning a better and better control solution. So we're doing both at the same time. So this is something that can be very, very easily parallelized in the following way. So suppose that I have a number of iterations. I generate, I have a certain controller. I have a certain model and I generate data which has trajectories over the finite horizon time. And so this is my important sampler that I do. I do this sampler and then with the sample data I learn a new controller which is basically the gradient descent which is this step, right? I do this gradient update and then I get a better controller and with the new controller I can generate data again. I can do this step. But the interesting thing here is is that you can very much parallelize this because the data generation, of course this is generating 100 samples, you can make 10 machines, each machine generating 10 samples, right? So you can parallelize the data generation. So that's this MCMC boxes. They were completely parallel to each other. And then you can do the gradient computation. The gradient computation is also involving a expectation. So with some of our samples. So you can also, the gradient is also can like the mini batches, like the 100 samples are 10 mini batches and each mini batch can be, the gradient can be computed for that particular mini batch, right? And so we can also parallelize from this data we can compute their contribution to the gradient. And so this can also be parallelized. And so then you can add these gradients and you get a new total gradient which you're gonna feedback and then you're gonna update all the controllers. They're gonna generate new data again and they're gonna generate new samples again. So this you can highly parallelize and highly optimize. So we apply this first to some simple problems. This is just to give you an idea. So this is an inverted pendulum where we have something that has to swing up with a cost which is essentially to be on top. And so here you see the effective sample size going on the scale of zero to one which is the maximum going starting base to get a sample size zero. So it's the quality of the samples that you generate gets quite high at some point. You see the cost going down and here you see then a 2D rendering of the solution where here is velocity and here is position between zero and two pi and here is the initial position where you start top everything down and then you swing up to this top position which is a controller that you learn for this position. Here's another example of Acrobot which is a second order two degree of freedom robot which you see here. We also learned here is this system. So you see that it works quite after learning. It's quite sometimes it doesn't work. And here's some details. So you see that the, well I'm not gonna go in the details. We're getting out of time. So let me try to round up and then we have time for some questions. So I think this whole kind of work can be used in the context of integrating a century motor control. So you could think of, so suppose that you have some sort of a loop where you initialize with an initial control and therefore each time you act in the world with your controller, you get some data and with that data from the outside world with that data and your current controller then you can define a model and a model in this case is function F say or function G in your dynamical system. You can make estimates of these models and then with these models you can then compute optimal control in that world and this optimal control computation itself is again falls out into another data generation which is the sampling data that you generate to generate your trajectories in the path integral. So this is the Monte Carlo sampling and then you can update your controller by the parameter estimation as I saw with this changing this theta variable, this theta parameter and oh I didn't tell you. So this can be of course, this framework of using these networks here. There is no particular model assumption here. So you can use very simple networks like in this case you use just a grid but you can also use a deep neural network and so in principle, so since these deep neural networks are very, very flexible and in principle since you can generate an infinite amount of data there is actually no obstruction to get close and close as possible to the optimal controller, right? Because you are in a happy situation that if you don't have data enough you just generate more data and if your model is not strong enough you just make a more stronger model. But of course there is a catch with the bootstrapping that initially you have a very lousy controller and so you're gonna get very lousy data and with very lousy data you cannot really learn a very refined model. So initially you can only learn maybe very simple model so you have to sort of cascade your model complexity that you first learn a simple model then learn a complex model and in this way bootstrap that situation. Anyway, so what I wanted to say to come back to the beginning of yesterday is that there's two types of data, right? So there's data that you get from your acting in the world and in this view you also have data that you get from generating out of your own model and that, so these are these two realities of the brain that I was talking about yesterday. So the first data is the one is the sensory data that you process by your Bayesian view point and the second set of data that you generate in your Monte Carlo in your important sampling is the one that you generate to compute your optimal control. So that was that idea. And that is all I wanted to say and thank you very much for your patience and attention. Thank you.