 in robotics, supervised learning is not enough. So one thing is you'll only learn about tomorrow is learning from demonstrations. It's a imitation learning. I quickly indicated there's two different ways for it. One is a supervised learning way. One is an inverse reinforcement learning way. That's why we're gonna talk about it tomorrow. And well, clearly no demonstrations ever perfect. Remember the first time your tennis teacher took you by the hand showed you this is a forehand and how many trials it took and we had the first ball over the net. If you're as incapable at motor skills as me, well it will be a lot of very painful hours. And quite clearly demonstrations are never good enough once they get transferred to a different body or very rarely good enough when they get transferred to another body even if it's done by kinesthetic teacher. The second part is that we simply cannot demonstrate everything. Try to imagine the world as if you were learning everything by imitation learning. Well that would basically mean that you would, parenting would become a pure nightmare since you would be teaching every single behavior and finally all the behaviors the kid needs and well probably by the time where it enters primary school it's already 18 or something. So simply because you cannot make this a full time job for a single child. So that reason we really do need reinforcement learning and I know you have had a couple of lectures there and I saw that you yesterday had a very nice tutorial. So there will be some redundancy here but I will put a robotics perspective and we are coming from control and in some cases in robotics we actually understand reinforcement learning better than these guys who come from, how to put this, from grid worlds, from tabular representations. Since in control we have been well doing this for much much longer for real applications than people have in well in the reinforcement learning world actually. And so there's some aspects which are just gonna have a very different angle. So anyways, you will have the problem that our robot has to explore less like this R2-D2 and you know we call mobile robots sometimes quite evenly in robotics mobile trash cans. In this case this mobile trash can really well, yeah, gets a bad reward because of well exploring the wrong option. So robot learning systems need self-improvement. We need to explore our trial and error and most importantly we require evaluative feedback. So and importantly, well it depends whether you're optimist or a pessimist. Classic controls people are always pessimists since they minimize costs. Reinforcement learning people are always optimists, they maximize rewards. But that's actually the same thing. If you just add this little minus sign, you notice oh well, that's not actually that different. So what you will be seeing today is how we can derive actions that maximize the well long-term reward. And for that let's start out with the, let's start out with optimal control from learned models where we really go this way of well creating a data set, learning a model just like we did yesterday, a forward model in this case, then we would obtain well what you've seen already yesterday which is called a value function just to from that obtain a policy and go all the way back to our data generation. Then we're gonna do one thing we're gonna get rid of the model in between. This has severe implications. Obviously there's a good reason to do this. I already told you yesterday my experience of destroying my robot in my first week as a group leader where I simply had learned a forward model which was really well, really really good except for one tiny region where it was slightly wrong and where my reinforcement learning agent just learned oh I can exploit this one region and unfortunately gave me a really terrible solution. Now that's one of the reasons why we actually like why people liked the idea of value function methods. The big problem with value function methods as you will see is that you're in the end have to fill up the space with the samples. This is really easy when you're living in a, well how to put this, in a grid world state space which has a really, really high dimensional grid world but where only very, very few discrete states actually matter as you have seen already probably to when you have seen the Atari game stuff which I think by now everybody has seen. This is not always the case in robotics and robotics actually sometimes have continuous state action spaces but you would really have to fill up with data and there are these value function methods also fail tremendously. For that reason you will see a third path which we will be going and that is the path of policy search where we try to directly from data using the rewards obviously or returns would approximate a policy. So these are three different roads which we're following and well hopefully one by one you guys get a slightly different angle slightly different perspective on reinforcement learning but from a core robotics point of view. Questions? Yes, sir. Can we say more about only having very few points in the space of mapping? So there was a student of Rich Sutton who actually created a shallow representation of the state space which actually mattered in the Atari games and it turned out to be an extremely sparse is which of the cells were actually turned on and off and if you, for that reason she could actually maintain this just all the cells just as adjacency lists instead of as big a race and well she could always find them really fast again. So in the end instead of having, well instead of having this problem of having, I don't know let's say 50 to the D table entries which would be gigantically big in the end you really just needed something well on the orders of magnitude it's less than this in the tabular representation and that is actually why it was still possible to use these methods, these tabular based methods. So, now let's start out with the optimal control with the current models and for that let's start with the, well with the problem you all have to solve once in a while so sadly my blob here moved a little bit too much to the solve as we're well kind of closer to up here but let's say you have this problem that you have won the best paper award and well you want to have an optimal policy to collect it, collect it. We all like such problems, right? So of course we have a network of flights and if you're not like me and you have this problem of making this flight, well never mind, everything will go well and you have even costs for these flights. You have a final reward of a thousand dollars and I have made up some arbitrary flight prices that definitely not anchored in any form of reality and you want to now figure out how do I maximize my return? Now every one of you learn in computer science 101 one algorithm which, graph algorithm which actually solves this problem. Dijkstra, very good. Fine Dutch computer scientist is Dijkstra but already before Dijkstra people started to figure out the underlying principle and this for us an optimal control starts with a fine gentleman called Richard Bellman. He is Richard Bellman sitting at the University of Southern California in the computing center and he came up with this insight of dynamic programming. Now dynamic programming and this core insight for it he called the principle of optimality. It was really core that you recognize he never talked about a Bellman equation. That's something we in reinforcement learning attribute to him. Bellman never claimed the Bellman equation. In fact, he would have attributed it to Poisson who had what we call in reinforcement learning the Bellman equation he's not long before. Bellman actually claimed something which is somewhat more powerful since he claimed the principle of optimality which is basically taking what we call the Bellman equation reinforcement learning and adding the max operator. Which is quite a unique difference since you could take this what we call the Bellman equation reinforcement learning you could get it from a linear programming point of view just in this case you don't get a policy from it but you don't always get a policy from it but you have to solve a very high dimensional linear programming problem. Instead what Bellman actually created is a general principle is this idea of that optimal sequence of controls in a multi-stage optimization problem has the property that whatever the initial state each state and controls are the remaining controls must constitute an optimal sequence of decisions for the remaining problem with stage and state resulting from previous controls considered as initial conditions. If you wanna make this take this in a very simple way it actually allows you do it. If you know what's happening tomorrow you can parcelate it. Today you do one maximization steps over all the, first one maximization step over all the possible futures then you go ahead and go backwards, backwards and backwards and you don't need to go forward in backwards paths. That's really a core difference to the optimal control literature which was around before that. This optimal control is actually 250 years old before people could actually build it people would use variational calculus to derive optimal control laws which they would build it in mechanics and there you have to always solve forward and backwards paths along a differential equation. While the Bellman principle and it's generalization the Hamilton Jacobi Bellman equation actually allow you to just go backwards in time and by that you can get efficient solutions. This is really quite a key breakthrough. I should add though and this is to some extent funny I think it's still funny this name dynamic programming which he gave them to the resulting methods has become widespread but very few of us actually know this story behind it. Does any of you? You do, okay. No, oh. And I know but. Okay, I'll find one of you knows and I'll tell the story. That's actually pretty cool. You can find it in Bellman's autobiography The Eye of the Hurricane and back then in the 1950s they already had a science minister in the United States who had the temperament of Donald Trump and also the science affinity of Donald Trump's voters. So he basically was extremely opposed to anything scientific or mathematical and Bellman at that time while he was switching from being a number theorist to becoming a controls person and he wanted to create he's a mathematician by training he wanted to create the mathematics of the optimal scientific approach to mathematics of the optimal scientific approach to controls. So very and then he was told oh but you know for this funding program this will actually go all the way up to the minister secretaries they call them in the United States and this guy becomes totally read as a science minister when he hears the word mathematics or science. So Bellman basically was told do something smarter and then Bellman who had been to the Second World War realized well this guy likes military and military called the optimization problems always programming that's actually where the word programming comes from that you would in optimization that you would solve an optimization problem by looking up in certain tables this was called programming and you would buy that know where to shoot a gun and that was a common thing in the Second World War before we had computers basically for it and then he said basically okay I call it programming so that the guy thinks about military and I call it dynamic so that it sounds much more modern that's actually the whole story. So in other words you have learned one thing how to choose good names since that one really stuck right? Now let's do what one would do with the Dijkstra's algorithm or actually with dynamic programming you would always start on our final day the day we die and obviously on the day we die if we are in Madrid we go to the conference we pick up our best paper award and enjoy the fame of the best paper the thousand bucks spend it immediately and have a good time but you know if you're in down here at the Atria we much rather wanna hang out at the summer school we won't get a financial reward for it but it will be a good day and similarly if you're in Paris you may wanna look at the Eiffel Tower in Berlin at the Brandenburg Gate and sorry Paris and Eiffel Tower, London at I don't know the tower in Berlin the Brandenburg Gate but in those cases we don't get any additional reward. Now we go backwards in time and in this case we can take flights and for most places it makes complete sense now to take a journey to Madrid except if you're in Madrid since then you wanna stay in Madrid and except if you're in Berlin because you won't reach it if you have only a single day to live so after your flight so you won't make it as there is no direct flight and now we can go a step further back if you go a step further back you see two interesting things you notice that one slide ahead of you you see two different things but you notice that well this year is actually cheaper than this year so quite clearly you should take a flight if you're in Berlin you should take a flight through Rome and sorry through Trieste and then finally to Madrid and while the other ones stay this way and now we just can go backwards in time and finally well we have converged to an optimal policy beautiful now this was realized in the 1950s and in reinforcement learning we have actually given it a language you should be very careful with this language because we normally in reinforcement learning focus on the stationary case from a control's perspective that's actually a pretty bad thing to do because the stationary case is only a very special case nearly all of the interesting things which are happening in our world are actually not happening in the stationary case they're happening on the way to the stationary case on the way to the stationary distribution for example and much of our world for that reason should be probably not even be modeled that way but for our problem this is fully enough so that's basically say now we have a state space so each of the cities becomes a state each of the flight because it becomes an action and we've been very lucky so far none of our flights got hijacked we don't end up in Cuba so we could assume a deterministic transition dynamics unless you have I don't know canceled flights and delays in which case it becomes more annoying and finally there is the reward function which in our case adjust these costs associated to the states it's our reward function here states and the flights one state and two of the flights and we have some initial state probabilities which tell us where we're gonna start now one assumption we nearly always make across all fields that's that we have a Markov property and that the that which basically means that the sufficient statistics of the system the state own of the of the next state only depend on the current state and the current action and you need to recognize that this is nearly always true since basically even if you were taking now a mechanical system very which would have is not where you would normally say well you're in you cannot jump in the position unless you're in quantum physics and you cannot jump into in the velocity again unless you're in quantum physics so for that reason these guys here all this will form a state but if you were discretizing this you would directly recognize yeah something is going wrong right I mean and until you well until you recognize that instead of using this we explain it badly instead of following the equation of mechanics of you minus minus all the forces we actually when we discretize we need to then also bring this in a first order form this is a function of q this is a function of q and q dot and while this guy is our action and we recognize that this guy here is actually our next state so let's make it a plus put a tiny delta t in front then we can remove one dot and we recognize that this here is s t plus one while well these two guys would be s t so a second order differential equation can be transformed into a first order differential equation and the same way we can do this with the ball all nearly all of the differential equations except for the ones which will grow in dimensionality and that's actually where we would always become non Markovian so for a robot where we're always talking mechanics always talking second order equations while we can safely transition to the Markov property by lumping well the position and the velocity into our state oops here's a plus missing so I'm using c plus plus notation here okay questions let's continue now reinforcement learning obviously is in the loop of taking actions getting next states getting rewards and then keeping going and we always I always write the policies in general with the as parameterized policies despite that initially we don't really need this and well what we want to to do of course is we want to maximize the expected long-term reward with a discount factor of gamma it's a good question whether you should put a one minus gamma here which mathematically would be more appropriate but it's just a constant factor so it doesn't matter for optimization but yeah that's the reward function which we have and if you now take the algorithm which we had been discussing so far well we would start on the day where we die and on the day where we die we would get take the final return the final reward and we would initialize our table of values with this final return and subsequently we go backwards in time we first create a table of all of our actions in terms of their quality and well always create the expectations over the future then add the current reward and then decode the optimal action in order to get the previous table entry and once we're done with this and we've gone once backwards well we can actually for every state action for every state hit in the table and every time step compute the optimal policy so straightforward which and this gives us what well what is actually the Bellman well is an application of the Bellman principle of optimality it gives us a maximum backwards recursion and well when you iterate this backwards long enough you would always get the V you could also do this backwards recursion in terms of the Q of the state action value function if you like now the policy gives you now here you basically here you see the different value of different states but you have a well really good goal state here here's at the moment the agent and well here are some puddles where you get wet so you get negative reward you get a big positive reward you see kind of the kind of potential field which grows as the value function am I going too slow about right so tell me if I have to speed up or going to go slow so this would directly fall out of this and then comes a sad story in some cases our max operator becomes much more expensive than in our states in this case obviously you do not want to follow that story you actually want to decompose into two steps the policy evaluation step which is obviously much cheaper in that case and a policy improvement step and the policy evaluation step well estimates the quality of the states with the current policy while the policy improvements in step well would just be applying the max operator together this is called well policy iteration so it really decomposes the steps from before where before we were doing the value of a state at time t that's the maximum over well all actions of the reward s a plus the well I'll just write expectation of the value at t plus one of s prime where this is an expectation of s prime obviously given s and a and putting a little gamma here and we were decomposing this into two equations now one is the equation of well we have s t now we have to take up insert the policy here so it's r s t pi s t t plus gamma e of well we have t plus one s s prime s prime s and then a equals pi t s t so importantly importantly by the way notice you need to index these with t if you want to do it well probably and then you get take the p of p t of s which would always just be the art max of well these guys in front importantly you could do put this you can do these two steps separately by doing them separately this costly step has to perform performed more less frequently than this somewhat cheaper step at least in big tabular problems somewhat cheaper step has to perform now the reinforcement learning theoreticians call it big mystery why this loop converges incredibly fast as a controls guy I do see something which is something very intuitive in it and that is that in one case we actually try to go over the whole future ready basically the infinite horizon future instead of the whole project, the whole life future when we do a single rollout then we compute the max and we kind of have well you kind of could see this as a branching thing which goes very quickly together and henceforth should go and give you an exponential potential what should give you make it that you only need rhythmically many policy a policy iteration steps in comparison to the value iteration steps which you would need to get to do in order to go backwards so intuitively they should just have logarithmically many steps out of here but here if the number of steps in this the complete loop is logarithmic this year would be linear in comparison to value iteration now tables are all nice but we all know tables are not the solution at least not for well most things which are interesting in robotics and luckily we have from the control literature one other problem which is much much more practical for us and that's the problem of a linear quadratic regulator there's a linear state a big vector its action is a big vector and we can make it time dependent again by now having a linear state transfer with a linear transformation of the state linear transformation of the action the drift term and we could have some Gaussian noise now many people make this mistake of understanding it that way that it has to be Gaussian for the FQR controller that's not the case it only has to be that it's unbiased is that it has this mean and this variance you can have higher order components and as long as you have no unsymmetric components so for example if you would be using a triangle distribution or a logistic distribution LQR would still work it only becomes necessary that you do this Gaussian when you're moving from the MDP to the POMDP where you're no longer observing the complete state but you have to actually put in a filter since in that case the Gaussianness of the state estimator when you put in a Kalman filter is fully compatible with the Gaussianness of the model in the LQR controller and in that case you actually need Gaussian noise it's a really really important distinction which most people don't actually know but Gaussian noise is for most of our cases the most practical case and then we assume a reward function at some offset which is quadratic a second-order polynomial of some form and of the state and the second-order polynomial of the action and for convenience you do the same for the final state and we also assume a Gaussian start state distribution now we do the simplest way possible we estimate this model here and subsequently we compute the optimal control law for it and for that I should not have drawn so much on this blackboard since I want to do the simplest case for you on the blackboard God I'm losing listeners quickly so let's do the simplest case possible and let's say we have just a reward function which is state I put the minuses into the state action plus sorry state times q times u and we take the simplest possible next state so we even do it without noise not that actually a little bit of stupid noise that is totally fine b u plus epsilon and now well what's the next step well the first step is that we look at the day where we die so the value function for the final day for t did we just say t plus 1? I forget for the final day where we die well that is obviously just the maximum of over all of our actions the reward function and two of you can tell me the maximum of this reward function if I don't have to consider anything else hint it's really easy ah come on guys my undergrads are more responsive than you are yes please who can take the derivative of this function and set it to 0 quickly in his head or on a piece of paper I'm not gonna let you guys go that easily quadratic function, bowl very easy if you want to separate dimensions, not a cross term is it too easy for you guys and you're just daring to say it ok I shall randomly pick people I can't believe this, my undergrads are more responsive ok I'll draw you this thing maybe then it's more clear can now tell me, somebody tell me the maximum of with respect to you basic math please Lorenzo how did you make them talk oh you didn't try right ok ah guys when you have a bowl and you try to find the maximum with respect to you well obviously the maximum in this bowl always sits here so the maximum here is of course 0 so this here is r x 0 and how do we find this, well we found it by taking the derivative of r with respect to you which obviously gives us r u, we set it to 0 which yields u equals 0 and this leaves behind only this particular term and this guy we will rename now and we will call it p t big matrix called p t and then we do the same step again just for one day less to live and for the heck of it I'll call this day t of x I again want to do the max well over u of r x u plus and well because you like discounting I'll put in discounting and you don't actually need it and now we have the expectation of this linear system times p t plus 1 so I called the big t now t plus 1 oh plus epsilon here so this guy initially looks really difficult right I mean you would have to open your matrix cookbook or something until you actually plug it all in when you have x q x u I should leave the max operator in here just in case u r u plus and now gamma and we recognize that we can of course be using the binomial equations split off the epsilon so we have as a next step here well the direct linear term the means a x plus b u transpose p t plus 1 a x plus b u which adjust all the deterministic terms and now follow two other terms two times epsilon a x plus b u plus times in years of p of t missing but you don't really need it as you will see in a second and here is well epsilon p t so this term here is zero because we have assumed white Gaussian noise so quite clearly this whole term is zero this term here somewhat more tricky you have to rewrite it first as the expectation of a trace of p tau epsilon epsilon transpose and subsequently you can pull do a few tricks on pulling things out and you will see it's the trace of p tau sigma epsilon which is funnily just a constant for the respect of your actions so we don't actually need to carry this guy around so it's actually only the it's only these components which actually matter now this allows us then that we can again take the derivative of this guy and i will call it despite that this is kind of an overwriting of notation i will call this q again maybe i will put a squiggle here below this q just to make the difference let's make this a bar qx u and well when you now take the derivative of this q with respect to u of qx u what do you get? well you again get this term so you get r u plus well anything which comes out of these products here and this this is actually where it gets interesting because when you look at the binomial rules this times this doesn't matter these times these obviously matter and well this times this also matters so let's do exactly that step so this gives us gamma b transpose times p t plus 1 b which is the product of these two guys and then we take through the two times the product of these other two guys so 2 gamma b transpose times p t plus 1 ax and again we solve this and you recognize oops sorry that requires a u you recognize well you can bring this all these terms to the other side this to one side which gives us u is well r plus gamma transpose pt plus 1 b inverse and something which I've missed like actually we have all of these have a 2 I should have added a 1 half somewhere to make it nicer but since this obviously had a 2 also this had a 2 and this is actually becoming a 4 in this case but yeah some of these go away and now this gives us 2 times term yep where is it this term gamma b transpose pt plus 1 a and this actually has something really really cool and you look at it this here is just a constant matrix it's we could just call it kt and it's what we call in control a gain and you have one equivalent if you do gradient descent in gradient descent what is a gain for control just in control a gain can be big matrix in gradient descent you just have a scalar gain and it's well any of you wanting to have a guess gradient descent the one parameter the learning rate we have an optimal learning rate here and now when we insert this into the upper equation again so now we do well v of t of x being q of can you guys still see this or should i move the chairs yes i need the terms okay fine okay fine i'll go for this corner so you you now put in in that v of t of x is well q of t of x u star where we need to put well let's add the stars because we like stars and where u star we can directly replace by k tx and this gives us a beautiful equation gives us the equation of x t big parenthesis open the q plus now inserting the u this gives us k t transpose r k t the gamma we only have to care about the mean so this is a plus b k t transpose p of t plus 1 a plus b k t times this should feel magical to you doesn't feel magical to you it should since when you look at this you recognize this is again a matrix and what we have here is what we in control call the recut equation just in this case without the access beautiful huh we've just solved the mark of decision process which is not a table now comes the sad part this is pretty much the only one which we can solve which is not a table i mean yes we can take different noise models then Gaussian but that's pretty much it but you will see this a little bit later we can obviously linearize now this year's called the well recut equation since apparently in differential equations it was already the discovered by an italian gentleman like 200 years before we discovered optimal control just in a somewhat different form um so very nice now we have one huge advantage with this this scales really really really well now i'll show you a problem which you could not solve with a well discrete solver at least in its time where i should say one thing um where we had basically now done the steps of taking a data set identifying from the data set the a b q and r and from this so this is model learning and from that we would get the k star t and p t so the optimal gains for this year well okay fine let's get the p t first which is the value function and from that you would get the k t which is the policy question too fast too easy you've seen this before now in that case i'm i'm happy so what you basically now know is how to compute well for this particular case that is up to the controller and well let's do this and let's hope i can find my mouse cursor i had it here it is this is the first trial and the system of course loses this pull you directly should recognize this pull is of course not a linear system but we can treat it as one as long as we well only as long as we only use the samples from the first couple of seconds and you notice it becomes better because the better the model is the faster we can actually um learn now um well okay faster we can and now compute good gains good controls and after about seven trials we actually can balance a pull now when you compare this to grid welding where you would have a pull in 2d and you the people do these thousands or more of trials in classical grid based reinforcement learning well on a real robot you couldn't even get this to run right i mean we have here here the actions are actually based on the complete arm so they are seven-dimensional you know seven-dimensional actions and the state includes both the pole and the arm since the arm is not supposed to go away now the arm is a mechanical system so it actually has 14 dimensions and in addition to that it has the pole which gives you in both directions linear because of the gives you two points which are being tracked and are treated like linear systems so this adds another well four dimensions to your state space so we're talking an 18-dimensional state space and the seven-dimensional action space there's no way you could discretize that question no it's okay we are learning a model the forward model of both these two points the rat and the actually the Stefan Schalker's that isn't it the rat and the yellow and they were predicting with this model both the next state of the arm given the torques and the next state of the of the these two points meaning the pole given given the given the arm now each time each of these gives you a 3d cartesian position and a 3d cartesian velocity and i've just see i've even miscalculated it so this is a six-dimensional state you get from this you have a seven degrees of freedom arm this gives you seven dimensional actions and it gives you 14 dimensions of a state so 20-dimensional problem there's no way you can do this with a great world so that's fairly beautiful now does this mean we can actually solve everything we can't do with great worlds and robotics by doing a qr no right why not anybody want to tell me i think i gotta tell you myself the world is not linear it's not an inverted look at just now the complete inverted pendulum yes when you're living in this small region for a pendulum now this is not just a pendulum just one degree of freedom well here the world is linear but even you do want to do an optimal swing up with limited torque you have to actually pump in energy to bring it up and that looks very very different so this is the the value of function that you see here and it has this band with a jump there's no way that these rewards are quadratic and the system you can also only linearize at the top around the equivalent part so quite obviously a lqr only suffices when we want to control a longer trajectory in which case you can do time dependent models or when we want to linearize around an equilibrium point when we have a constant would have a constant linear model so reality is highly non-linear function definitely not a quadratic but nevertheless we're doing the second trick so for the pole balancing we did the trick of learning a model just around the equilibrium point now you could also learn time dependent models and this is what chris did back in the 1990s when he learned time dependent models of the of now in this case just this 1d swing up and collected from these trajectories and computed well computed optimal trajectories for different starting points in which case you can always well compute a locally optimal solution but it really takes you away from this thinking about policies to really think about a policy as a collection of trajectories in classical control this thinking about trajectories is these days again becoming commonplace and most of the methods which are people are developing around there are around a very trajectory centric and not very policy centric so how would we do this well we would use this simple insight that we can take any function and using a tailor approximation approximated either by linear function or well linear and quadratic function so we could take our system linearize it get these forward equations and we could take our rewards and we could approximate it by a quadratic function and now this brings us completely back to the lqr case but it's not so trivial this time because this time we can obviously compute the solution but the model of course changes every time I compute a new improved I go along a different trajectory with other words I have a different time dependent forward model which I can compute from a machine learning approximation but still and in this case I need to compute the optimal gains using a backward solution and then run a forward propagation through the trajectory to the end and obtain the new linearization and go back and I keep iterating this until I have an optimal trajectory now the person who came up with the lqr I should quickly say with lqr was actually Rudolf Emil Kalman who saw the duality to the Kalman filter again interesting anecdote Bellman had supposedly put a PhD student in charge to also do this and the PhD student while he was obviously not in the same league as Kalman he didn't manage to and from what I understand Bellman got really really angry with that PhD student he was a quite a hotheaded guy and kind of threw him out of the window from what I understand so better never anger your advisor important other side lesson um that I don't know I hope for the PhD student it wasn't too high but then again that would have been a prison sentence for Bellman too I guess so now this idea of linearizing around trajectories was in control actually around for a very long time and I think I don't know exactly who's the first person to present it some people say it was Bryson who wrote this Bryson applied optimal optimal control book the interesting thing is they actually had to develop it in order to manage to make it to the moon since for the moon project this was the first time that this linearization based control laws were actually put in place I should add also they back then already recognized that this would later become an algorithm they didn't know the name about that but they they basically realized that a very simplified case of that is an algorithm we all know algorithm which has a very famous name any of you get any of you have any idea which one my god he has a silent okay back propagation that back propagation is just a special case is where you do a forward propagation just in this case your trajectory is the signals you send through a neural network and back propagation is when you're going backwards through the neural network so it's it's really people control this in the 1960s where people have wrote fantastic books about this too then at the end of the 60s like the Dyer McReynolds or the Jacobson and main book now we didn't bring it to a robot learning until Chris Atkinson and Stefan Schall's work in the 90s where you can now well I can't believe this I checked all the codecs yesterday what the heck you know the wrong one even again paddling is wrong two where is this thing actually here ah here it is okay so what's the most important lesson about this don't change your computer so here you see first how Chris demonstrated this task and demonstrated how this task would look like and you should notice Chris looks rather young because this was in the 1990s late 1990s now you see basically how it pumps and learns how to insert energy and it becomes better on a trial by trial basis now you should imagine this is 1990s so he basically between these trials needed a knight to train well this case very shallow neural network as a forward model and as a model of the reward and then he would use this in order to do the trajectory optimization on this neural network for the whole knight and the next day he could run the next trial so the computation was really the bottleneck there and you could learn well as you see there's different solutions the first kind of solution are the ones with a single pump up and if you punish action slightly more it'll actually require here the double little pump up where you swing twice before you make it to the top and um well and this is an example of that and well this is a highly non-linear problem and well it's kind of the generalization of cart pull you should notice though it's one one case one way simpler than the pole balancing before because it wasn't pole balancing in 3d anymore it was just pole balancing now in one in 1d because at the end of factor the well there is just one joint and not a freely floating pole and i can return to this slide so there's way way more these days so emor Todorov went around and reinvented this algorithm and he calls it now ilqg where this is really just well it's kind of a simplification of chris what chris and these guys used there were simplification of the 1960s method you can do a lot of different things with it and really cool things things ranging from well ranging from swing ups again and just as chris did i had just in this case with the two with the two links and one not actuated you can have it go along trajectories even and today we have the the key differences today we have the computation power to do such things in real time to run the control laws in real time which back then you would always have one night in between and this obviously in control is led to one completely different field to arise that's the field of model predictive control where we always compute for a limited lifetime ahead the optimal controller and we recompute for a linearized system and we keep doing this at every time step and for that reason we will have something which is very very close to the optimal solution if all look ahead is large enough and these days computation is so cheap we don't really need much more than that and well here you see a couple many of these solutions which amor came up with this group came up with can even perturb it pushes a bit forward to humanoid when you look at these movements actually look quite natural which is very different from well what people do in these open AI gym benchmarks for example okay cool i think i've made the point home to you well optimal control on learned models can be very powerful i should also again make the point to you it is really really dangerous too because we can guarantee the existence of an optimization bias and that is that the expectation of the max operator of some estimate of our q function is guaranteed to overestimate the max u of the expectation of this estimator under the q function now that basically means if i have an error my optimizer will exploit it this is exactly what happened to me when well i destroyed my robot so quite clearly well it's not everything and there are two and a half solutions which we can actually solve one half is discrete systems and the world is not discrete one half is well linear system quadratic rewards and well white noise is well ideally Gaussian but the world is not linear and then finally the half case of an optimal trajectory and well along an optimal trajectory well finding obviously the optimal trajectory is hard now that brings us to the second type of models and by the way how am i doing time how until when do i have time wise just since i'm probably totally okay i guess we'll have some of reinforcement running tomorrow okay i'm very slow today okay let's start now with value function methods let's cut out the model and really the best statement about models was by chris adgesen again this time in 2016 in 2016 at the conference humanoids he made the statement that clinton was model based and used strong predictive models of who would vote and how well they would trump did not use any models i mean in for clinton they knew up to the counties what biases there were and she really optimized whom she visited did to get this get a perfect strategy and he just dragged it out of gut feeling and well dopamine based is and as we all know from peter dayan i'm sure you talked about dopamine that's a very strong reward signal so with other words well learning a good model can be incredibly hard any model is wrong but some are useful that's what taki a famous decision once told us and we're prone in optimal control to well use our model errors against the system and exploit them and then we basically recognize of course is that even for moderately non-linear tasks it because it's not that easy he to do optimal control but for non for really non-linear tasks it becomes nearly impossible and well model freer approaches would have this advantage that they don't need this assumption on the structure of the model and this brings us to what well some people call classical reinforcement learning that you want to learn the value function not the model it's somewhat interesting how things spin um when you talk to rich satan up to 2000 around 2005 before that he would hate anybody who used the model and then came i don't know a few changes in his life if and a few research results and by now he for example is totally pro model in general the jury is actually out on this question since we still don't we it's it really depends on how good your model can learn and also and well what we figure out about how to cope the model with the model bias in the future and there's very little work on that so now that's too classical reinforcement learning let's learn a value function not a model and for that we of course have to recognize that well our we need to change our objective we need to now always go for infinite horizon and because you always go for infinite horizon well we need a discount factor which is not smaller or equal to one but it's actually smaller than one okay fine you can also make up a formulation where it becomes one but then everything becomes more complicated this discount factor in this case allows for stationary solutions now the controls guy if i put on my control head not my reinforcement learning head i hate those but from reinforcement learning perspective they're obviously kind of beautiful and you would have the discount factor which trades off long-term against immediate reward as i know you guys are not gonna answer me anyways i will propose give an answer my own question so what is the discount factor in well what would be the discount factor which we see in economics all the time the one the european central bank always controls by that making at the moment italy's life hell and germany's life better and in the past the opposite you know the gdp is a state variable um this case and it can't control the gdp but it can control the money volume by interest rate exactly so you can transform the interest rate into this discount factor gamma by one divided by one plus the interest rate or something like that so and by that you can well make sure see how much people can borrow and by controlling how much people can borrow you can you can basically influence the economy and as countries are built differently it's a really stupid idea to have the same interest rates across different economies unless you really want to unify them which would require well in europe's case a political union but that's set aside so key thing is now we take what people in reinforcement learning still call the bell call the bellman equation as well which bellman clearly wouldn't claim the poisson equation you would call it and which we can express the value of a state as the expectation over a policy where all the policies are drawn all the actions are drawn from this policy and another expectation over the model or in integrals we could write it like this similarly we could express the q function slight difference we've just done policy evaluation but in this case not with a deterministic policy but a stochastic one and in classical reinforcement learning well what would we do we would create a data set of the samples we would not assume a model but we would not want to learn it and we would like to learn the q function right away and let's start really simple let's start with a catapular q function and this gives you well quite a beautiful thing it gives you that when you have one transition of one state to the next state receiving a reward and state in action you're receiving a reward and a next state in this case you would could have an estimate of your prediction of the value of a state and at the next state you could see well our current prediction estimate would be well the reward which we received plus the value of the next state and when you subtract these two you get a one step prediction error which we also call a temporal difference error and well yesterday in the tutorial I think it was called a time difference error which I had never seen before but I'm sure some people use called time instead of temporal difference and once you plug this in and you basically say well you update your value of your state well this becomes the past value of the state plus learning rate times that this error signal and only when you unfold it it becomes fairly logical because it's one minus gamma here times the original value plus gamma sorry one minus alpha times the original value plus alpha times this target and anybody who's well seen well a geometric average before knows this is just a geometric average slightly underestimates the rhythmic average but in the end this is an average your equation once you plug in this insight you can actually prove that this algorithm converges for tabular representations asians always and with the probability one which is kind of nice intuitively what it does is that it take this td error basically compares the one step look ahead with the value function which we have estimated and if the one step look ahead is bigger than the value function well we should increase the value function if the one step look ahead is smaller than the value function well we should decrease it and this gives us this model three policy evaluation where we observe a transition we compute a td error we update a value function and we keep doing this until we converge for our v function or our q function accordingly so all we've created is a sample based version of well policy evaluation here so what do we need now we've created a policy evaluation step so we need a policy improvement step wondering how i'm getting you guys awake this is going to be interesting policy improvement step and so before we were doing this analytically and we were just doing this analytically now we've replaced this by samples which obviously means we have to do something smarter here too so if we do a max operator for policy improvement well we're obviously going to jump to the deterministic policy and we will never change again this would be kind of as if you wouldn't give me a deterministic policy you blindfold me and i would be standing for the rest of my life hammering against this wall trying to get out so quite clearly we need exploration yesterday you already worked with exploration two things which are common value function methods are about epsilon greedy where you well would have a bigger probability for the best action and uniform probability for the rest and other people like the softmax policy which you can actually derive also from first-order principle if you like since the softmax doesn't actually fall from sky you can actually show that this is a soft version of optimal control so in a sense that's actually much more interesting one so you would not take the greedy action and now it starts to become more complicated obviously we don't want to do this for the we function if you want to learn a policy but we need as you have noticed in here we actually need the q function and there are two ideas come one is well we again well follow of course the same insight we take the previous table entry plus the learning rate times td error and now we have the td error in this case of q functions and then we notice well there is one big question here what actually would be the a a question mark here the rat a question mark and there was a big discussion again in the late 1990s is in the in the 1990s when people came up with two solutions one was by chris vodkins who by the way still wrote the phd thesis which was the most cited reinforcement learning phd thesis ever it's something like eight or nine thousand times i think it also makes up for 80 percent of his citations and what chris basically did is he said well you should choose the arc max for of the q function so basically he follows the inside of dynamic programming today i'm doing something crazy like smoke a cigarette but tomorrow i'll act optimally and never smoke again the cool part is that if you visit every state action pair infinitely often for for this side here this algorithm on a tabular representation is guaranteed to converge to the global optimum this again you can only prove by using these average equations with the geometric average and to tell me akula is the kind of guy where you need to look up the old papers if you want to learn how he did this nicely because this a question mark is not the action of the next state you can actually do this off policy so you could sit around watch another agent and this you could use the actions of the other agent and learn from him then there is salsa for you would obviously take the action of the other agent of yourself and you would need to know them and you would well estimate now the q function of the exploratory policy you would always require on policy samples you guys have a question please to ask i'm missing the interaction here so and the policy well which we're generating is obviously in this case well in non-stationary since it obviously depends on the current values of q already now let's approximate the q function since after all the machine learners we use function approximation and well the easiest case would always be linear function approximation where we have some parameters omega some basis functions and we want to find the parameters how could we find these parameters well again temporary difference learning what have you learned this morning well going down the gradient is never a bad idea so let's try something like this and let's start by taking the mean squared error and approximate it by the mean squared well bootstrap error where we now plug in well the value of a state minus the value of the predicted value of a state minus the one of an estimator and we of course recognize in order to get this prediction we now have a recurrence here since here is again a value let's call this omega old this parameter value of that guy so we we basically do what the baron of minkhausen you've probably seen in the fairy tales he pulls himself out of the water by the hair that's exactly what we're trying to do this is the hair and well we do the pulling and that's why it's called bootstrapping how would we minimize this function well we know stochastic gradient descent so let's do exactly that we have the function now at omega and for maybe I should switch here for the moment we have the function omega we read we plug it in now we plug in this temporary difference error and most importantly we keep this guy fixed the blue omega t needs to be fixed while only the red is allowed to be optimized if you do this you get a gradient descent approach and in this case you have the td error times the basis functions which then allows you to do well gradient descent you should very importantly notice this is not proper gradient descent we have actually added a hack here but it can only and this hack is actually quite different mental we can show that this estimator is not unbiased but follows a different metric as if we had the true optimal true solution and we're optimizing towards the true solution so paper by Ralph Schoeknacht again in the early 2000s who showed that this is but will converge to the wrong solution however if you do true gradient descent you can only do this if you had a model and you were capable of either pulling many different actions from this model or even do a complete averaging over in which case you can do true gradient descent which we call residual gradient descent and reinforcement learning which then would converge to the true solution but at the same time well it would not be statistically sound if you don't have a model or visited every state action pair well infinitely open so here's a big big downside temporary difference line with the functional approximation despite that it's behind most of DeepMind's successes is actually not scientifically sound if you do it in this online because of this gradient descent problem and we still have no real idea why you only have good methods which can deal better and better with it on a stability point so now this gives us temporary difference learning temporary difference error we correlate basically the TD error with our feature vector and if you use tabular features it's all easy and you can prove convergence to the true solution if you do function approximation we know this will all fail and well you can do the same thing obviously for q-learning you can do it for sasa and up to a couple years back this was the key thing if you want to get an overview over the best current methods for value policy evaluation step in temporary difference learning look into Christof Dunn's jmlr paper he's really done an excellent job comparing these methods and decomposing them in terms of optimization methods Asian steps so now important dry foam not propagating descent sadly target values use these target values change since they would in reality depend on on omega if you do very very tiny steps it in practice usually doesn't matter in practice if you're on policy doesn't matter if you're off policy we've shown that even for linear function approximation you can and if you can't represent the true q function you can actually show divergence even worse for non-linear function approximation we know that every possible method of these can break and again this has been shown nearly 20 years ago already and there are no fixes despite them of all the successes we've seen so you can really get divergence if you're off policy with a function approximation or if you're non-linear going on policy and well td learning is fast in terms of the number of features much better than than doing a model but each sample is only used once which is obviously well super inefficient it has tremendous successes in terms of of games and you all have heard about taters and go many of some of you may have heard in the 90s already we got to world champion level at backgammon with the td gam by tizaro from IBM and the Atari games games also well are really quite famous there now before i take you down to the robotics aisle one more time one more thing one more method or method string which is actually somewhat surprising that it only resurfaced now back in the 90s already Justin Boyan tried this idea of well let's do big batch reinforcement learning and he basically only said oh sometimes it converges sometimes it doesn't and well in the end kind of killed it nobody tried this for a very long time until two europeans came around one was damien erinst in in belgium and one was martin ritmiller who back then was still a professor in osnabrack and both of these introduced this fitted q iteration which actually tries to go backwards in time the other family which came around the 2000s was lago darkest in part who basically said well we could also do what least squares does is and instead of doing gradient descent we could one step compute the optimal solution which in reinforcement learning actually has slightly better convergence property is than the gradient descent not by much though um i'll spare you the first one since it's basically just writing down the equations of gradient descent and then solving what you've done just this morning already the reverse direction where you first got the optimal solution for least squares but you can do the same obviously for uh and then you enter for gradients you can do the same in inverse for the temper difference error but i'll take you down to this fitted q-duration verdanian anston martin ritmiller really pushes forward since basically they recognize that you want to use a non-linear functional approximator or more complex function approximators and many of them you can only make work in the batch setup like regression trees and um well some of them would really have catastrophic results if you do a lot of forgetting like neural networks and you have these divergence problems of neural networks works when you do the q function so this seemed to be all pretty bad but if you do fitted q iteration this suddenly becomes much easier so what's the idea of fitted the idea of fitted is we replace expectations by regressions and voila very simple idea we have this data set again of state action reward next state now we initialize the q function for the day we die we write down the input data like for our estimator now we create target values based on the rewards and the maximum of the q function which in first step would be zero but later different and now we do a regression instead of computing an expectation giving us a new q function and now we go back in time and just like value iteration this approximates the q function at every time step at every iteration going back this is a huge difference this batch setup in terms of efficiency if you can make it work the big problem is making it work so why did Justin Boyan who tried exactly that in 1998 to 2002 in that time why did he fail it and do this kind of silly talk and report where he just said created this big metrics of things which worked and didn't work and why did Martin Riedmiller succeed both are using neural networks both are using very similar neural networks i mean the only big differences that Justin just used i think regular actually Levenberg Markowitz style while Martin uses his favorite oprop optimization and it really lies in a trick the trick in it was that what Justin Boyan had in there is that the minimum of this q function had actually a constant value so it had a value of a certain size with the result that the network could always increase and increase and increase now Martin on the other hand he basically had this idea of saying why don't i fix my q network at the goal state so he zeroed it out at the one state which he wanted to get to or you could have done this also for multiple states where he wanted to get to but at the goal states it was supposed to be zero and everywhere else we can have arbitrary values with the result that this loop of of regressions and maximizations has at least one fixed point and everything becomes kind of a spawning tree from that had fixed point on between the samples and by that fairly efficient now Martin had an amazing career out of that and especially in robotics they expect in the nine in the 2000s since he was actually professor back then at osnabrück now who of you has heard of osnabrück nobody i guess oh one person they have a nice cognitive science department but they have a big problem and not they don't have a computer science department so martin was basically a computer science professor in a psychology faculty and he was was trying to win robocup now the ones of you who have been to robocup know it's what germs call a material which means you throw as much material in terms of students hardware computation tuning at a problem of having robots play soccer against each other and between 2000 and 2010 teams which won that at about 2008 of today the teams which win this have giant crews who all try very hard who all have a lot of human intelligence which they put into the system best computer science students in the world best hackers win now how did it look for martin but he had a couple of psychologists and i think one phd students as osnabrück is also very poor school it's not a very well it doesn't have a lot of money with the result that he had to do something smarter and he brought in this fitted neutrally fitted q-learning which well allowed him to learn and learn with a just single phd student really really fast learn behaviors like here dribbling behavior here where you now see convergence to success and here the convergence of failures much much faster than and well much much faster than you could tune it by hand and they would do much better than the best hand-coded policy which you see here in terms of success and failures and could even learn from the results of other robots not just of his own team so this allowed martin i think to become born 15 times world champion robot soccer with this team him of well very few phd students like never more than two phd students and well and really drive forward this method it also made him one of the earliest reinforcement learning employees of deep mind who very initially spent a sabbatical and well then became a full-time employee giving up his professorship by that time even also he had moved from osnabrück to frybok which is a much better and more interesting university but well call of deep mind was stronger so with that we had the end of value function methods and they've been really driving force in reinforcement learning since the 90s they've given us professional level backgammon and checkers players they've winning the robo cup they've brought us otari games they brought us go so all of which with well not all of which but most of which were the minimum of manpower so why wouldn't we in robotics call them in general the method of choice well you need to fill up your state action space for the samples in robo cup you can do this quite nicely because you're dealing with mobile trash cans living on a floor the floor is generically 2d if you bring in the orientation of the robot maybe two and a half T but it is never it the 65 dimensions of a humanoid robot it's not even the 14 dimensions your um state space dimensions you need to worry about when you're dealing with the robot arm so that basically makes it makes certain that well we can always fill it up with the state space for the sufficient samples which is not the case for general robots so we have another curse of an exponential explosion not the explosion of the number of discrete entries into a table but of the number of samples which are required and well these these errors of value function estimation can be shown to have detrimental effects on the policy you can actually show that the biggest error if anywhere in your state space on your in your value function can affect every other state in terms of the policy so the best guarantee which we get in value function methods is always in terms of the worst approximation which we have which should somebody who does supervised learning just may go bonkers because that's that would be it's like a nightmare scenario right and so it can be really hard to do control hill because things become unstable also small change in the value function can cause a big change in the policy just imagine if this action of going to the left and going to the right was nearly equally good and i've chosen mainly to do this action and now one action these this action became better well suddenly i would have to completely shift and that may actually destroy this these changes may quickly destroy the stability of our approximate policy iteration so it basically only scales better than optimal control of the learned models because our samples are always at the right location and to some extent you can ask the big question whether well approximate tabular representation of your state action space wouldn't and then doing dynamic programming on it wouldn't do this job just as well as value function methods and again on this the jury is still out so now i'm at the last part and i'm supposed to stop in three minutes am i right or am i supposed to have stopped already so i'll take questions for three minutes and we will do the policy search part then tomorrow unless you want to give me part of your lunch break okay who wants to give me who are you part of the lunch break okay some i see already trying to run okay then i basically take questions