 Seems like I'm losing ten student every day Okay, so Wow, I'm lost. I am left with the last two-thirds of the of all students, huh? What did I do to scare you guys all off? Is it all the robots? They're not gonna jump out of this cream and kill you oh Well, okay, I think by now. I hopefully have persuaded you Learning a model then computing an optimal control law based on the learned model Was a really good idea Chris Atkinson and Stefan Schaal did it and they made a lot of rubber learning gold on it with it and Well, they this worked quite successfully for them But then I also should explain to you that a small error in the model is detrimental it destroyed my robot and Made me go back to my work splunk director and tell him Oh, please give me a very expensive repair ten thousand euro repair in the first week of my robot usage so We saw that this optimization bias which the optimizer will exploit with a when working with an approximate model can be detrimental and For that reason we moved to value function methods following the ideas of rich sudden and basically seeing what an robot learning actually made Martin Read Miller so successful in RoboCup If you can fill up your space your relevant space with the samples You can actually compute a value function without having a model and you can compute Well quite nicely an optimal control policy without explicitly modeling your system so that's actually also quite amazing, but Again, there's the caveat you need to have the samples in the right space And now if I look at a humanoid robot Well, we all know it's not all this this space which you can even fill up with samples So you need to somehow sometimes become smarter Well, as far as the wrong words sometimes become well Even closer to the system and that has led many of us in robot learning to pursue the idea of policy search Interestingly that idea is well, thanks to the efforts of Open AI And Sergio Levine and Peter Abil now because become a little bit too sexy for its own good Since many of these algorithms don't work as well as they make you believe I know it since I developed some of them despite that they don't have my name on them anymore So take it with a big big Bobby's not a grain of salt, but maybe more back of salt When you read any of the open AI results these days So let's start out by Why actually we like to do well policy search and what actually brought us to start and it wasn't actually the filling up the Samples part it was much more this part of well if I hadn't Well a navigation problem and I had to go to the left of the table or to the right of the table and back then Well greedy operations all seems in value function methods really the many ways in the way to go like in Q learning and Which would make you go very aggressively either to the left or to the right Despite that these two options had only very very little difference so And this has a big problem because if Since basically if you have a small change in your value function Then this can small change can cause a detrimental large change in your policy Which in very high-dimensional state action spaces Will screw you over and not become stable. That's something which we saw with the greedy operator a lot You can do value function without a greedy operator, but this is really what What got well policy search originally started in robotics and that policy gradients Which would somewhere take the derivative of a policy would actually do something much more stable Because the small change in the well and well the sufficient statistics not even necessarily value function would Only cause a small change in the policy, which gives you a more stable learning and this Obviously, well, but this was one of the big big reasons why people got into a Policy gradient policy search in robotics the second reason It's going to become actually more important when we get to the second part of today's lecture That we can actually build upon imitations much easier since we can directly initialize The parameterized policy by an imitation which usually gives us a good starting point Just like in tennis where the tennis teacher gives you a forehand in the backhand by kinesthetic teaching This gives you a very good well initial policy representation and you can subsequently improve so and in a way there's always been two ways of Doing policy of doing policy search One idea is was the black box approaches where we completely ignore that with any knowledge about the system We plug in some parameter perturbation. We get some output perturbation back now if you see this is a tailer approximation and In this tailer approximation solve for the derivative it becomes actually quite a straightforward least squares problem where you can get your derivative Now rebranded under the name of random search Ben recht for example is just well Very deeply impressed large chunks of the reinforcement learning community by basically saying oh if you do finite differences You can actually beat many of the state-of-the-art algorithms on the open AI benchmarks in a way Yes, we knew this for What 30 40 years yeah, but actually keep up all of its goes to the 50s in robotics people have been doing Parameter perturbation well as long back as I can think so I've I Know methods from 30 years ago, which did robot learning by parameter perturbation, you know, but In order and well other kind of what Ben calls random search but which actually just means finite difference gradients and There's a ton of methods since the simulation Optimization community has been using them well since the start of key for Volvo wits in I think 1952 so that you can really find them all over there and Then there is the second kind of methods and the second kind of methods are the likelihood ratio policy gradient methods they're more of a white box approach and they have They've been on and off in Reinforcement learning for also a very long time the first time they were around was in the Was when neural networks were totally hip the last time so in the second neural hype when basically he people like Gullah Pali and It would reduce them under the name s are we and Williams to introduce them under the name of reinforce popularized them and reinforcement line and they were In other fields again, they're much much longer known the simulation optimization guys Knew it for them about them for at least one more decade with the work for example of Peter Glenn and What they have as an assumption is that you basically write down the probability of a trajectory Also in the end balls down to that and once you have the probability of the trajectory and let's say the rewards of the trajectory You can discount them if you like you don't necessarily need to well in this case you could write down the Expectation over this trajectory distribution this trajectory view actually makes it much easier to derive things than the state action view in this case and Then you use one trick which is called the lock likelihood trick and It's actually as primitive as it gets You go back to your high school studies You remember the derivative of a logarithm of a function. It's just one divided by the function times the derivative of the function Everyone of us has seen this in high school the Russians properly even in primary school So in the end you can rewrite this to it also this derivative then of course in terms of the original function and The logarithm our derivative of the logarithm of the original function Very primitive, but it's very very useful for us because we can now do one trick We take the derivative of the expected return over all possible trajectories we write this nebla operator we move it to the inside and Once you move it to the inside Well, you can do this because the integration limits don't change and once you move it to the inside you replace this derivative By well just the probability times the log derivative for your first thought would be why okay? this didn't change anything until You do one trick you use samples when you recognize Well, when you use samples you actually only need this term. You don't actually need to know the probability anymore Now that again should initially make you feel well, okay fine. So what? but Once you compute the log derivative of the probabilities of a trajectory distribution Within the mark of assumption there something really really cool happens first thing all the products become sums Okay, that's a more stable derivative. That's directly good and then we recognize. Well, this guy doesn't depend on our policy parameters and This guy here doesn't depend on our policy parameters So what has happened? Well Thanks to the lock we have a big constant and the sum of the log probability log Policies is up here. Now once we take the derivative Well, the derivative of the log trajectory distribution just becomes the derivative of the log policy so That has made our life obviously much much easier Because we don't need to carry a model around Now you should notice one important thing though in the moment where you move to deterministic policies you actually Want to keep the log the soil logic of the log ratio likelihood gradient around you have to again Bring the body have to bring in a model, but if you're working with stochastic policies, you're totally fine Now in the moment where you plug now these two things together so we plug in our log policy derivatives and we plug in the reward and Voila, we have something which is called the reinforced policy gradient Yeah, and what happens people tried this reinforced policy gradient and some people are really good at tuning already and well the 90s these during the end of the last neural network hype and What happens? Well, they managed to train actually quite some impressive things There was a guy called Ben Perhim who even taught a robot very basic robot some gates on how to walk And there were a few other people who did fairly impressive things with it but most people who tried this algorithm completely failed and This isn't very interesting implication because what we're having here is we're having a basis function vector Which can point in pretty much any direction and we're always multiplying it by a positive number So thinking about this let's say this is our 2d parameter space and here's our current policy Well, we may have just like let's say two such unit such vectors arising from The log derivative now those could be more depending on our number of actions But let's say we just have well, let's make it four. Let's make it four and if every one of them is now multiplied by a positive number Well, this obviously means That each of them has a contribution pulling into a different direction and that of course Well when you have this completely clean you have infinitely many samples this is easy and your gradient will actually be Tiny Lee your resulting gradient will be a tiny vector. Can you even see this from that far? Can you see this still? Okay, I'll just trust you you you will have a tiny vector somewhere going to this direction But most of the cases we have noisy gradients and now we have noisy in a really bad signal to noise ratio That was one of the reasons why well people stopped using them and It didn't work well Now there's one trick. Let's hope. Oh, no, I didn't put on the slides There's one trick with which you can work it already make it work much much better And I did not put this on the slides. This is the idea of a baseline So you can actually subtract a constant value from this without So you take our minus a baseline and can subtract a constant value from it Without biasing your gradient estimate the reason for that is that well, if you were just doing the you're just doing the probability luck probability that's of the Of the path distribution. This is zero for the reason again that The sum of the derivative of p is zero Which comes from the fact that well p sums up to one this block of the baselines Makes a huge difference because now despite that we have these Basis vectors in all directions now we would actually only have contributions Into with wonder some directions and even if we are slightly stochastic our gradient would be going well on average and the right direction so this idea of the baselines has been around again since the early 1990s, so since the last neural summer so to speak and Nevertheless, it still worked pretty terribly People only managed to make it work by well having the baseline and they managed to make it work by having very peculiar learning rates and We actually figured out and this really the biggest credit goes to sham kakade who introduced it originally in a heuristic way and then both through and I independently figured out that This actually was there. Well, that was actually the true way of doing things And that is that you should be doing what amari already introduced in supervised learning. You should be doing a natural gradient now natural gradients sounds like The first time you hear about it. It feels like a totally fuzzy construct Until you learn what it is really about and giving you the intuition. It's actually quite amazing It's the intuition is that a normal gradient is when you take a black circle you put it onto your parameters Space you make it sufficiently small and now along the line of the circle you search for the best point That's what the steepest descent with the respect to a circle means and In the moment. Well, we could do something like here We have problem with the probabilities being between zero and one and again zero and one as parameters And we see that the gradient here would always bring us to the edge It would never bring us to the optimal solution So quite clearly if we were deforming this metric so that it points more to this direction This would be a very smart thing Now obviously deforming a metric is something the second-order methods do with the Hessian and People initially tried this of course on reinforce with a Hessian and reinforce for the Hessian Just simply didn't work because it it still had the well It was deforming into the wrong direction. It was in fact pushing you even faster Into the boundaries where you had well zero exploration and therefore zero improvement so Then basically you could do what well then You of course need to realize well, what kind of deformation do I need and that deformation is Of course the big question well in what kind of space do I want to live and of course I live in the trajectory space So if I want to compare policies, I should compare this by what difference do I make in trajectory space? so that if If I basically want to measure What difference my new policy with a parameter change of delta theta does Well, the right way of measuring this and or the most natural way of measuring this is by taking the KL Between well the new distribution and the old distribution in the moment where you want to approximate this KL Well, what do you get? you get a constant and Well, you get one half Delta theta times some matrix F some Delta theta again Right, this is just the Taylor approximation. We don't care about the rest so We end this year. We don't care about either because it's a constant and we suddenly have this magic term in Supervised learning on even an unsupervised learning this magical term Actually is the hash in because our objective function is the same probability distribution usually as the as basically the Space where we're living in In reinforcement learning on the other hand The space where our trajectories are living in is a different space is is different from our objective function Which is only used in the expectation If you take the derivative then only of this expectation with this additional reweighting and This was in supervised learning. Well Amari makes the point This is the Hessian and it's actually a very effective way to compute the Hessian supervised is learning while in reinforcement learning this Fisher information matrix is Well in the path space that was really the key difference between what drew and I did to what Chum did that We figured out that you have to do all of this in the path space in that way get the correct Fisher information matrix and once you plug it in well, you can plug it in for example as Staying within a circle where the circle radios as an epsilon and this would give you this kind of a deformed matrix Once you solve this optimization problem Something fairly cool and magical happens because just like as if you were going with the respect to a Hessian You're actually just going now with the respect to this Fisher information matrix and it comes with a really two really nice things Fisher information matrix the fish information metric basically punishes is You quite rapidly if you want to go outside of the parameter space Because you're going to if you're going to the edges of the distribution space so you're not very likely to go for zero exploration if you Bound your loss with respect to the fish information This is one big advantage. This is an exploration exploitation related advantage The second big advantage is that it actually is something that it is in which is called the property of being covariant This is something we totally under Appreciate still especially right now in the time where gradient methods have become parametric methods have become so powerful and useful Well all over You want to actually have it that if you reparameterize your algorithm You give it a slightly different representation. You actually want to have it to give you the same answer I mean, this is a request which Donald Knude wrote down in the nineteen I don't know fifties sixties and one of those first books He made that request of algorithms in a machine learning. They actually fade mostly failing on that because Everyone knows we normalize everything so that we get somewhat more predictable results But that's kind of the most hacky way of how you could do this now in within once you use the fish information You actually at least for linear reparameterization get exactly that So is it useful and I can give you a very strong yes here I've taken two very simple problems One is again the problem of linear quadratic regulation in this case with a stochastic policy you have already seen how you would do this with a model learning for optimal control and It's just a 1d LQR problem, which I'm choosing here And I have here the controller Exploitation so the the amount of noise I add to my actions and you have the country as our exploration You have the controller exploitation, which is basically the gain which I multiply on top of my state Similarly here we have the two-state problem where one probability is here The other probability is here We all see you want to be in the state where you get a two reward and you want to take the two as many times as you get But if you are in zero you kind of are biased towards staying there At least if you're looking at immediate reward while transitioning gives you nothing Now if you look at the regular policy gradient, you see it's actually doomed So even when you do the baseline trick you do the best possible tricks You're doomed because well for the LQR policy gradients. You would actually only learn Exploration is expensive. Let's get rid of it And voila you wherever you are unless you're directly located at the optimal gain You will just end up at a solution which has Which ends up exactly at zero exploration so not particularly nice and Here even worse for the two-state problem. Well if you are already close enough Well, you may get to always staying in Of a staying in this state, but quite frequently you get actually pretty bad Local optima such as well. You're actually want to stay in zero all the time because you get immediate reward there And you won't actually get to the optimal solution Well, if you look at the natural gradient that even in the worst possible scenario So the scenario where you nearly always stay in zero you will keep staying You will keep going to the optimum along the edges that's really the power of knowing the metric which defines your space and Well, damn it. I thought I had a video here So one thing we use this for was for back then teaching for example a robot to hit a t-ball I don't know why that video is missing But I will just continue from here since we don't have enough time today question. Oh Look, look at look it. Look here the gradients in LQR all point to the optimal solution. They are like They are the ideal gradients and in fact you can even show for LQR problem that the natural gradient in with the right learning rate Will always oops will always with the right learning rate end up directly in the optimal solution If you have estimated it well infinitely precise, so you could actually do a single step there This right learning rate is is actually surprisingly simple I think it's just the square of the variance One divided by the square of the variance and they expected return something among these lines I would have to look it up. I derived it 15 years ago. I don't remember it Yeah It would it would basically try I mean look the gradients point all downward They would go all to zero to zero exploration and Therefore stay at a constant game. Now. Why am I bringing up these well pretty old results? Well, most of the algorithm, which you know under the name tier PO does actually nothing but Well, it's basically nothing but a different renaming of what we used to call episodic natural actor critic has some tiny variations in it, but there is Extremely well, it is basically nothing really new in there And this makes much of the mileage current what's currently going on now, I still want to take you away now from this idea of Of doing gradient descent and policy space For the simple reason that in robotics. This is a terrible thing to do if you have to do this on policy on a real robot So I'm sorry if I'm not having this video at the we did this ball on the This t-ball example where we put opposite ball on a stick and the robot was shooting that ball away They're required on the order of a thousand 200 trials to get this right so this meant that I as a PhD student had to run through the lab a thousand two hundred times For a single trial run to get the ball put it back on the stick. Can you imagine how frustrating that is? Pretty frustrating so for that reason we in Robot learning we have been doing something for a very long time now And that is that we are trying to search for alternative. Well for kind of a surrogate functions Which make us more efficient, but a built on this in the insides of well supervised learning and This is where I really want to lead you and it really has some very nice foundations in how humans do Stochastic actions but when humans do stochastic actions they do something pretty crazy and that is that when they're learning from Their own trials iterated decision problems. They do not just jump to the best action or the best policy But they take the rewarded frequencies of actions and outcomes now that's pretty crazy and But in the end it allows you to keep exploring While doing well kind of larger steps along the larger steps towards new policies They're not necessarily even along the gradient, but in which don't require a learning rate So we want to create a new policy, which is kind of like the old policy Just reweighted by well this case for immediate state action scenario by the rewards But more frequently by the rewards now sadly this of course only possible for non-negative reward functions Intuitively this looks a lot like classification if you had a zero one reward well, I would want to jump only to the ones and Not to the to the zeros but let's make it not zero one Let's make it one and two reward so that in for the one and two reward We would like would of course like to rather grasp it to a one In addition to all the tools than having no data at all here That's kind of the intuition there And this turned out to give us something very useful and the set again in the banded scenario been realized by Peter Dejan Geoffrey Hinton a long time earlier But nobody had made it work in long-term reinforcement learning or even considered it But we make the reward an improper probability distribution move to the lock from the basically in supervised learning the lock likelihood to the lock expected return and then get it from this get a lower bound on Your expected return Which luckily has the reward only here so we can so we can end the policy parameters of the new function only here So we can actually make this in the M algorithm which jumps relatively fast from good policy to good policy You can really arrive policy gradients that way, but that's not very helpful But it would give you again and again a new bound you would do very small updates But it also allows you to do larger steps and well, oops this year is the scenario of Ball in the Cup where we starting from an imitation Learned exactly this behavior. I'll let the video play out one more time for you guys Already have shown it to you on Monday. You would actually give it a small reward based upon proximity Between ball and cup and it would get better on a trial-by-trial basis Unlike before we did not have to rush through the lab that much And we needed way way fewer trials going from the thousands of trials We are sudden we were suddenly in the domain where we could learn things within hundreds of trials Which for robotics is actually quite a big difference because in the end real robot time and Time and even worse Real robot time is already costly in comparison to computation time and and then in addition to that It's not just real robot time is costly You need to multiply a real robot time by a very large factor Which is the factor of you taking care of the robot repairing it bringing back the ball bringing back the Untangling the wire for the experiment that is kind of awful and so on in the end Within 90 trials we always got to perfection. I don't know a single human being who makes it to perfection I think the best humans get to a success rate of 60 percent That was a fact that I didn't notice I didn't say a Monday. Yes Yes You process it On this case we have a we had a ball detector So we would directly get out of the cat out of the vision system We would directly get a 3d ball position and velocity And that was immediate The biggest step is for computing the policy update, but even that is it was nearly immediate so If it had Yes Pretty much so it's quite a step from when I showed you yesterday Chris had to always wait for a night of Doing dynamic programming through a big neural network back then Where he well while we actually we are running this in a chain we would you know On one afternoon you can run it and I think 20 30 try the complete trial runs without actually Yeah, too much bothering It's still that you need one afternoon for well one plot in a figure Just of that when your experiment works perfectly Still quite a step from well when people deal with simulation Yes, sir This is not a mutation approach. This is a Policy search approach You start with a human with initialization by imitation for the simple reason that searching this space is On a real system is not a feasible option You could now of course do what well what we've studied talked about yesterday You could learn a forward model and then you could incrementally Search this space with all the pros and cons and so doing that Mm-hmm So this depends right now we only give it a reward based in proximity of ball minimum proximity of ball and cup Which is a relatively informative reward. I gotta admit And it doesn't tell you anything about the quality of your actions if you plug in what we did earlier When I didn't show you the video but the one where we had the ball on top of the t-ball stick There we had to actually give it a reward which had much more Assumptions in there. So obviously we had the reward we were interested in and that was the reward of shooting The ball as far as possible but that is very quickly achieved by a policy initially you get a lot of immediate reward by Earlier reward by just hitting as strong as you can even against the the t-bar not even against the ball this flies already quite far and The system would directly go on this path and probably never recover in addition to that break the robot So there we had to actually in a punishment on the actions in addition if you had a punishment on actions you can unlearn more of the human demonstration in Robot table tennis We actually had to do a lot of unlearning of human trials and that had in this case We didn't even need to do it through the reward function But there it happened pretty much automatically because humans can do very high accelerations much higher than robots can I mean robots can move faster. They can move more precise. They can see more precise and easily and they can see faster but robots cannot accelerate like humans is the advantage of muscles and For a robot table tennis. We saw one thing in the well first kinds. We were learning forehand strokes We would show it a stroke and the robot would be pretty terrible at it because humans flick, right? They do a really quick acceleration through the ball Which is obviously very effective technique of Moving the ball, right? I mean the last the outgoing velocity of the ball is well the incoming velocity of both Plus the duration of time where you accelerate through the ball if you accelerate it With such a flick the ball can be at the other end The robot can't do that. So the robot basically just had the reflection So it had to move for longer to get to a higher speed and then just reflect The ball better and it would learn exactly that during well when we learned forehands and table tennis But and this is simply because we had a max out of the acceleration Which came naturally out of a max out of the torque and Well, you know you commanded the robot to do a higher torques The motor just wouldn't create these torques anymore. It would saturate as they call this in robotics now I Think for policy search At least even in in robot reinforcement learning in general. I think we're having through always three key problems and One is that while getting a notion of what data is in the prop getting this into the problem formulation in Value function methods. We really do this juristically that we plug in data We always have this problem of an optimization bias it gets particularly bad when you're using a forward model and The role of features is nearly obvious unclear Here are how you get actually features in a non artificial way in there Since if you do what people do in supervised learning with the features Well, this this very rarely has What about this net a natural relation to reinforcement line? now I think there's What you should take back from the whole policy search ideas that well You clearly should do as much as you can on the observed data distribution in robotics because that's the only data We know we have for sure The second part which you should take back is well You should always punish the divergence the distance between your state action Distribution which you're generating under your new policy and the one which you have observed up to this point and finally, well We have to actually define our MDP on top of feature functions instead of well hoping that we can find some abstract states Which actually makes sense and you can actually hard-code these assumptions by taking the classical optimal control problem and now well oops adding Well, just these red components Where here we say you have oh, sorry, this should be completely red So just these this completely red equation of bounding the information loss plus being only only fulfilling the forward propagation on the features and We get to well I think this is the road to go for developing reinforcement learning algorithms in this particular setting of Discrete stuff and so on We actually can make this work. So we get a weighted Gibbs policy He where you could replace the Delta also by the Q function if you liked Since this is well, it's the advantage function And when it's most interesting you actually get compatible critic functions These compatible critic functions Fundament well, they give you different loss functions for how to derive value function updates Which you can then again use or value function or other sufficient statistic takes updates and This all resulting from well the key assumptions on the original problem so quick wrap up before I move to the topic of imitation Policy search is a powerful alternative to the value of function methods and the model base is to reinforcement learning methods Personally, I'm not a fan of policy gradients at all anymore. I Think they were a valuable step for me to understand the whole problem But I think at the moment they're really even is leading us and it's quite surprising how much I mean you have every major machine learning conference You have 10 15 papers on them in reality They're not a bit not such a good tool They only work you can only make them really work in simulation because you never get that many system interactions as you would need Or when you had learned a model before that and in addition to that well Unless you do all the tricks of baseline and natural gradient. You can't really make them work that well Learning the exploration rate is actually still a hard open problem and I think the well the probabilistic approaches is are Quite useful anything which has worked can is so far been shown to be linked to relative entropy policy search There's some strong Well results by Gagley no, that this is actually something very fundamental behind this idea Okay, with that we're through with policy search and I've well whoa taking three quarters of an hour for that Minus the time we were late questions guys Okay, then I move on to for the second hour to imitation learning and I hope so I have two imitation learning lectures and one is on imitation learning by behavior cloning and The other one is by an imitation learning by inverse reinforcement learning It really depends on you giving me signals of how much whether I'm going too fast whether I make it through both of them Now imitation in robotics is super useful right then we're getting a demonstration as a starting point can be quite crucial and in fact Most of the most impressive robot learning results have been done by imitation learning And there's two types one type dates back to well, you could say Mitchie in chambers is Read a really long time and that's the idea of behavior cloning So what is behavior cloning? Well, you would be like Garfield here You show behavior to your robot in this case a very human robot and Afterwards you use this behavior quite straightforward, right? in Human sciences this dates back to the 1800s Thaun Dike called it learning to do an act from seeing it done The first really impressive result of it was actually due to Dean Pomalow who produced Alvin Which well reproduced steering actions for a given retinal image Among the things you've seen well, you've seen how to get from experience data to well from This is basically all the different types of reinforcement learning Well now we are trying to get from demonstration data to an optimal policy right away And this requires that we look at well two different things One is that well, how can we teach a robot without too much programming? One is of how we can get actually policy representation, which is useful for robotics and I will tell you a little bit about motor primitives Let's start with it learning policies from demonstrations by supervised learning. This is super successful for humans since it's well For the simple reason that well if you were learning what we do as humans all the time, you would still be in primary school Right no self-improvement But we're just doing self-improvement we could not explore the world sufficiently and This as the search space in robotics just like in humans is usually too large and On the other hand you usually have in robotics You have an expert who knows well, what is a good policy think about the neurosurgeon? I mean if you ever have a prostate operation in your future as a man Well, most likely this is not gonna be executed by a human director anymore. This is actually executed by a human joy-sticking a robot and It's quite quite impressively the human is joy-sticking the robot without haptic feedback So just from seeing some pictures of well a very tiny part of your body from the inside The human is to figure out Well where to cut and well how to not cut the wrong thing And save you let's say from cancer and It turns out that nearly all Interesting animals can do some form of imitation learning yesterday I learned that even bees already do imitation running which I hadn't known but For rats, this is really well studied that when there is a companion rat a rat will Red will imitate it and well you can block this and subsequently We'll see that How well it does imitation For dolphins it goes even further they can they are directly programmed to even imitate facial expressions of humans I mean given that we are totally different space and That's also the case for infants which even at the age of 42 minutes can they don't always have to Learn how to copy Facial operations so we're really So kind of like when you think about it if you think of us humans as a robot and You think of imitation Well, it's proper. It's not very unlike the best comparison. You could think of is it's kind of like with the bias the bias So the iOS in a computer which has a bootloader to load the operating system in the same way you have to think about imitation and Well, we do this with a lot of different ways in robotics So we teleoperate robots puts by with the joystick we do kinesthetic teach in where we Well take the robot by the arm just like a tennis teacher We use sand suits like this year where we plug a human into a suit so that we can give the robot trajectories from that and Well, then we can do marker-based training or in the best case We would actually be doing complete computer vision and reconstruct from three from well a video But the human was doing and put that on the robot The last part is the holy grail, but I don't know of any well good results, which has actually worked with the last part Yes, sir. Yes, they can see they can't see very well You can do one experiment which most of the childcare books Proposed to you draw smiley's and see how they react to smiley's from which range and That happened that they can react to that very quickly. So they basically have to stop crying That's why they need here this case 42 minutes. I think I didn't try it with my twins until they were a couple days All and it only worked for one of them But then you can actually show Why you can they scan there if they want to they already they a little bit of their own will at that point Not a lot of it. They will imitate your facial expressions So but seeing is It's near instant That's actually why the complete computer vision community for a very long time said Oh, we just have to find the right priors And then vision will be a totally easy problem in a way what the deep learning guys did is they found the right priors by Well deep learning and big data sets at least for some of the vision problems So we clearly don't learn vision. That's that one you can take for granted other questions okay, now Let's say we have chosen one of these Operation modes and now the idea of behavior cloning is surprisingly simple Well, we have a trace of the expert's actions So it could be velocities and positions of the joint angles and now the student has to infer a policy Which in the old days used to be a deterministic function These days we usually think of it more like a probability distribution In order to not limit the actions of our dem of our robot so much Especially as we would like to use this probability distribution later for reinforcement learning usually in principle and behavior cloning we can treat this as a supervised learning problem and If you treat it as a supervised learning problem Well, you could extract the policy by assuming some features some parameters and you just have a regression problem This sounds totally crazy, right? Why would this work? Surprisingly in most cases it works surprisingly well And the part of that is that your regression problem Actually cleans even up the human data and you get a smoother and better solution than what you have demonstrated because we humans have a lot of jitter and the learner the learning system will clean up this data thanks to regularization and In the end give us quite frequently something which is better than the original Now we can represent this problem in two ways in the state space formulation or again in the trajectory based formulation And we do this usually using parametric policies so condition distribution of actions given state and parameters and We can do this quite nicely from continuous actions We demonstrate and do the imitation learning and well if you want to improve well We would need reinforcement learning after it's we have quite a few Properties which we of course want we want to find a representation with a low number of parameters so that we can do Reinforcement learning well. We wanted to be able to do reinforcement learning with it later We do want to have the variability of the humans so humans never do exactly the same thing The you can again show that the variability which we humans have in our movements is highly functional We will when you show people a task like let's say dart throwing Nearly everything in there will have you could model by a relatively wide Gaussian distribution Extremely well with that gosh distribution over times trajectory of over the trajectories over time Except for that one moment where you're on the release manifold Since for a successful dart throw you have to be in a very very small Component and of a very very small area and you have to pass through this bottleneck in the right direction so exactly around this area This trajectory distribution collapses and we are suddenly pinned down with very very little variance So on the one end. Yes, we do want to take all the statisticity and keep it but only when it doesn't harm our behavior and Well, sometimes we can even decode from this distribution. Then what was the reward function of the human? and well, we want to be scalable to many degrees of freedom and modular and be able to compose with it and we want to do this both for rhythmic and for stroke-based movement As I said, well, it's just a city gives us variability and the exploration of the human Again the exploration the tremor which you have in your arm I mean you all hold your hand in front of you. You'll notice your finger has a tiny tremor. This is actually functional it actually helps you when you're interacting with an object And you're making and breaking contact for example or when you're the variability helps you when you're steering a car And it's a new car and you need to figure out. Well, how fast does it react to your actions? You would other if you had to do deterministic exploratory movements Well, you would have to really do this year and you would do have to do it consciously Instead you get this to us get it due to the statisticity for free Well for that reason well, we try to learn the trajectory distributions Where we most cases even have well the different output dimensions quite at the first people to Yeah. Oh, yeah, and you could learn this well with whatever supervised learning you like to have and The first people who did this used actually a linear representation Asian for the linear features and already in the 1960s use supervised learning to learn the task of pole balancing for imitation learning So really just by linear regression This is a 96 of course a super great result because I'm pretty sure they got the input data by typing on a keyboard or something But they could do this already was fairly impressive What was more is more impressive is actually the results these kind of things already did in the now in the 90s in a flight simulator where You even see the cleanup effect already where the learned autopilot is actually doing a much nicer and Trajectory leading back into landing then the original trainer has been doing simply because he is not as nice You can do this also well with non-linear representation so these two are for linear representations EG with RBFs or neural networks or mixture models you name it and The most impressive thing for that Has been in the 90s mid 90s or early on mid 90s with a nav lab Now mid 90s You got to imagine was the time when we thought Autonomous driving was nowhere in the future in the foreseeable future people had tried to hack it up for very very long time It was really in the end imitation running and for a few other methods which showed this was a wrong assumption and Now today we are in the opposite extreme They promise us that we will have autonomous cars in the next two to five years Don't buy it. I think this could be even one of the causes of the next neural winter that Silicon Valley CEOs have completely bought this idea of autonomous driving will happen in two years and Well, they will be disappointed for sure That much. I think pretty much all the autonomous driving guys even they agree upon so Yeah But back then well an autonomous car required a truck full of computers sensors of this size so this was a lidar so laser range finder and Well, they actually used this to show that you could learn to drive all the way from CMU up to San Diego with the no hands on the wheel but then on the no hands across America tour and That state images that these camera images and all they would be using is doing is they would be predicting the steering wheel the brakes and the gas and They did this with the two layered neural networks working on a 30 times 32 sensor of the retina And you would get well these kind of signals in there Now luckily this of course only works Because they were doing this in America, right? Try to do this on an Italian road I'm pretty sure they would have failed right away And I don't want to think what if would have done on an Indian road We're the lot of people even more people than Italy running over the street at arbitrary moments So in the end just non-linear regression and I hope this video works I tried it this morning. So blame the computer if it doesn't I Really hate this about my new computer here You see no flap and you notice it is going rather slow, right? But nevertheless complete the autonomous humans just sitting back filming and Here you're on the road. You notice this is still in around Pittsburgh Well, which is rainy and cold kind of like Germany and Well, I think this year is already in a much nicer area then It has some obstacle recognition but well it is going at a speed which Even the camera when you look at these camera pictures how slowly they went you recognize that this was mastery of the Sensing and acting technology of the day just as much as it was mastery of the learning technology so Now you should have some doubts on behavior cloning from state action parents First of all, I can probably always make this breakdown and I can do catastrophic failures. Yes, if I bring you to regions of the state action space where you haven't been I can do small changes to the system which make your policy unstable and in the end already as The students of Michi noticed. Oh Well a single human trajectory. He nearly always worked best. So It doesn't actually work very well across different teachers and There's no guarantee that the reproduction is meaningful and even worse. We're doing supervised learning on data Which we pretend is IID Actually isn't right. It's correlated throughout trajectories and Well, this gives us well not necessarily the per really good long-term behavior and Well, it's basically only when our actions are surprisingly Uncorrelated that are the farmers drawn from the same distribution that the individual movements become easy But this is led us in imitation learning to develop Time-dependent representations, but we really have a time in the policy as well So this could be well really follow longer term trajectories by having internal state variables And it It's one way of doing this is for example to learn what people control called gain scheduling But we have some weighting function, which well schedules both gains and offsets of course This becomes already much harder learning problem But at the same time that allows us to well learn for example variable stiffness controller Which is a really important thing you want to have a controller which can when I step down here I want to be very precise for a long time And then I want to in the moment where I make contact I want to be as soft and squishy as I can and for that I need a very very different gain than before and just that moment before when I wanted to be as stiff as possible so that I can be very precise and Well, we can learn this with a variety of things people originally started out with a kind of spline based representations in order to directly down this trajectory and controller For a long time RBS then in time and in time and state became very popular and Think by now most people do imitation learning of some form do this in a form of encoding your behavior in a dynamical system Now then make a system you could think of always as a very specific in this case is a very specific recurrent neural network And well, let's go through one of these strategies. Let's go through what Stefan Schaal once started calling the dynamic movement primitives and Show you a little bit of the results there So basically what you're learning is a well trajectory generator Which can react can or cannot if you don't want it to react to your actions imitation learning with trajectory generators well requires that you will learn long-term behavior, but you want to follow a trajectory and You obviously need still something which maps desired trajectories to torques But your robotics turns out to be a rather simple problem And the easiest way of how you could think of a dynamical system, which would encode a behavior trajectory Well, think of the first order differential equation pushes you to a goal stops there and Well, you could deform that one You could actually now create well the second order differential equation which could do more it could overshoot go back for example and well Even more you could do something which has attractors Repetters and so on and could get unstable Well, if you wanted to you could even well encode things like limit cycles or chaos that way What was a surprising result from neuroscience is that humans in the end appeared to have something like a Well like a dynamic assistance representation of their movement and even more interesting They appear that they would have this in either a discreet form of a point attractor or Arithmetic attractor and they would be stored at different locations in the brain. It was a result of the early 2000s and It gave us this this big understanding that Well, we only need to encode two types of dynamical systems in the end and into these systems We can actually put in all the abilities hard coded Which allow us to learn well to learn movements, which we need for anthropomorphic robots So we can encode stability perturbation robustness, whether it's a point-to-point or a periodic behavior We can put in more complex shapes We can learn them quite fast as you will see and we can couple higher degree Yeah, a number of degrees of freedom and have things like rescaling retiming and generalization to other external factors and all we do this is by starting with this initial assumption of a spring an overdamped spring which pulls you to a goal then taking the spring and Well adding the forcing function to it so that we can with this forcing function encode more complex profiles Like for example doing this year So which would allow us to first instead of directly having to go to the goal during it tennis movement Would allow us to go back and then through them through that particular point But it's particularly nice that such a dynamical systems representation Only depends on a face variable usually somewhere and you can rescale time by rescaling Well this parameter alpha z but at the same time with the alpha and beta you could actually rescale amplitudes so this is rescaling with the time and At how well we can go differently fast so higher tau higher speed and Now we could represent this forcing function by some form of an weighted basis functions approach For example here in matrix form and by construction get it for a stable forcing function Which even is guaranteed to converge as on the long run it becomes just PT controller for infinite horizon if you go wrong if you let it out run forever Integrating the system then obviously it's a trajectory and when you perturb it You can actually go to a different parts of the trajectory so we would start now well here We have a periodic movement which has a latent variable z But just kind of this reset Dynamical system we have some basis functions here and from that we could Take we could actually learn this complete movement from data We can change a goal of the movement by moving this this g We can change the temporal scaling and we can change the amplitude of it. So in the end Becomes actually quite simple to learn trajectories for this you first take a desired trajectory and its derivatives you plot you obtain from this the final position for example use it as a goal and You have well timing and amp you have amplitude and timing parameters which you extract first from or set add them to normative values and Subsequently you can modify them for doing movement composition then you compute all the target values and just do Linear regression as before You can do surprisingly cool things for this now this year is Hitting tennis Hitting a virtual tennis ball and well This year is the humanoid robot DB hitting In this case while real tablet tennis ball on a stick So we use this for an imitation running for rhythmic behavior where we now would give it Obviously the position of the ball to couple to so we give it more state variables into the imitation running approach there's a ball on a string and You learning the right parameters for the human movement was actually pretty much out of the box And again as I've told you on Monday Hacking up this behavior After six months with the best possible control engineering methods did not hit the ball more than two or three times Finally, you can do things like what June more emoto did in Japan. He learned the coupling into well between the floor and the robots gates and He actually started he actually got his imitation learning being in a very weird way He took human data from a book and by hand took the trajectories out and together with the Well with the coupling in over to the floor so he also extracted that again and manually from this Japanese textbook on human movement and He actually got a pretty good policy One which would allow well this robot Walk so this gives us well the big advantages of These are the desirater on movement primitives going to be data driven so that we can easily learn from demonstrations we want to generalize then we want to be able to combine primitives by Activating them together rescale them by changing the timing couple them Represent the coupling between the degrees of freedom have the variability of the teacher and Ideally even have the optimality of the teacher and do this for both the rhythmic and discreet strokes Now with the deterministic Dynamic systems motor primitives. We only got some of these but in the moment where you move now to trajectory distributions from a deterministic function and You basically do the same step as what we did when we moved from linear regression to Bayesian linear regression or from kernel rich regression to Gaussian process well, you also do this movement and Get a trajectory distribution and also a generator over over act this trajectory distributions given some additional input And we can do this by well representing a single trajectory first with the face with this face dependent basis Have a probabilistic model and then well actually integrate out all the parameters as we would do in Bayesian linear regression And well, let's do this without we can do this with our old friend the Gaussian Where well, it gives us mean and variance from the human teachers this distribution Where well, we can do this obviously for multiple degrees of freedom right away He in order to ball in order to get all this then our trajectory generator Now I want to go through this fast that I still hopefully have time to tell you a little bit about inverse reinforcement learning I will not tell you all of it obviously That's why I'm hurrying through this now, but you could now really well encode the complete Trajectory generation distribution here generated by optimal control. You can do this then quite nicely you can You can also well very nicely go for different goals and even nicer if you have two Different demonstrations so the blue one and the red one you can actually combine them now Into a new path distribution and by that modular completely modular like completely modular combine two Totally different behaviors Just one which goes through these two via points This goes through this via point and it leaves open all the behavior in between for further learning Switch through that now you can learn things like this Maracas task, which is kind of a shaking task And blend between different well different styles and even go back and forth between different styles In an ice and we called it an ice hockey task. We've learned from Canadians It's more like curling what we're doing here Where we combine two primitives in order to Well either get I'm sorry. He very first used the Reparameterization to get angles or different distances is of the book traveled and Well, when you combine these two you can actually get Well, you can you can basically use any kind of location you want to get to and Similarly by conditioning we could Well do selection So That means we have actually solved all of these core questions called desire desired us for a movement primitive representation Obviously one primitive is not enough Instead you really need an architecture like this one here where well We have many primitives which all create motor output Dependent on visual input, which for us usually means some form of information on the object and well on the teacher We've applied this in a robot table tennis where we have this scenario of here ball launcher Different cameras a teacher a ball a buried Wham and table tennis table Even I'm in the picture surprisingly so Now we have many different Primitives you could learn forehand smashes back ends and we somehow have a selection mechanism Let's call it a gating network as a mixture of experts And which could select based on incoming ball own position of the robot opponent's movements and prior opponent play And well Oopsie Now here you basically see the wrong one Here you see some resulting thing robot table tennis playing behavior In this case only forehands, but if it can also do back ends and well these are different forehands by the way, it has 25 forehands, I think and No, you can learn table tennis quite well So there are four core questions in imitation learning which we Need to answer What I've answered today to you is actually a tiny outtake Since in the end well, what have we what to imitate Is at what level of abstraction? Is actually a really hard question It's somewhat even ill-defined question Since I mean we have the obvious questions of how to deal with outliers redundant data The or data that's irrelevant to the task But then actually well and maybe even figure out the relevant components But at some point well, we actually need to figure out the level of abstraction of where we want to imitate things and If you're getting from a to b could be done by very different means Well, maybe we don't need to imitate walking when we could set into a car and drive If the imitation part is just about getting from a to b Then the how to imitate Is even for this task we consider a really difficult one Personally, I like to avoid this question by doing kinesthetic teaching after all humans do it in tennis too But um practically spoken Well, the body of the teacher is never the same as the body of the student We already noticed this when in table tennis when we take the robot by the hand and show it accelerations It can never reproduce the accelerations of the human So it needs reinforcement learning for any of the faster movements Any of the high any of the movements more interesting movements in order to Well relearn them so that it could actually accomplish them and well the Body of the teacher is really never the body of the student And so but instead we have this correspondence problem Then when to imitate I don't think anybody has even started answering this question But uh, if you really want to build a robot that imitates But you would rather want to first sit it into the corner It should watch you and that then should start to decide hey I want to imitate only the second this particular segment I have no clue how you would actually answer this question And then it gets much worse when you recognize that oh, there's going to be multiple people in there in that scene and the whom to imitate Well, that makes it just crazy then but I think these are questions we need to answer in imitation learning in the future and um, yeah Now I have 20 minutes In these 20 minutes. I would love to still do. I mean, I'm sorry for being So much slower. I don't understand totally understand this Uh, since I still want you to understand a little bit. Oops. This is the wrong one This is the one Inverse reinforcement learning, okay Since I would like you to understand at least on a basic level Um, what's happening in the last type of robot learning? I mean, you have learned about Model learning you've learned about three ways of reinforcement learning and you have learned about one way of inverse of imitation learning and when you look at it then well Behavioral cloning can bring you very far, but it also requires lots of demonstration And in the moment where you want to have intentions or goals Or you need inverse reinforcement learning, which is also known as inverse optimal control inverse optimal planning and so on and What it does is it determines the cost function of the teacher to obtain optimal behavior And the basic assumption behind it is well The reward function is actually a more concise description of the behavior Then um, well the actual behavior would be So what have you done so far? You've solved four problems One more to go. Let's be over with this And we are gonna i'm not gonna do all of it today I first want to do a comparison to behavior cloning Then I will very quickly go through these three categories And I will probably skip some of the applications and give you a conclusion Now in behavior cloning You would obviously look at again the same problem. We have some state Some learning algorithm, which wants to do the actions Like this rover here, which belongs to cmu, which is supposed to get here and it's Well has brush here. It has rocks here. It has grass here and a tree here Obviously, you want to avoid the tree. You want to avoid the rocks and only as the last resort You want to go through the brush You really want to find what's invariant To these different features as they're observed here And Surprisingly the people who do inverse reinforcement learning They actually focus on one of the biggest successes of behavior cloning To take it apart since they look at the old nips paper of alvin Um Where they well found this wonderful quote That if the neural network is not presented with a sufficient variability in the training To cover the conditions It is likely to encounter when it takes over driving from the human operator It will not develop a sufficiently robust representation and perform poorly In addition, the network must not solely been shown Examples of accurate driving but actually to recover i.e return to the road sander once the mistake has been made When you read this you actually recognize. Oh, wow It felt like it's fall some claps fall in front of your eyes away since you've directly recognized Oh, that was actually pretty dangerous thing which they did. They just drove so much That they hoped that their state action distribution would actually resemble a realistic state action distribution In reality though If they had had to like, I don't know a frequent add to recover Had to have a recover behavior of a little person like a little kid running in front of the car Probably wouldn't have worked that well Since they would have had this recovery baby exactly once and In the training data and if this had happened again in the test data Most likely the cleanup effect which was so useful for making trajectories nicer Could have also done away with it And even worse, you don't want to show the robot system 15 or 20 recoveries Is of the same situation or even a million recoveries of the same situation So you really want to give these demonstrations more weight so You really need well the right variability in your demonstrations You need a Well, you need to have lots of these demonstrations And you want to recover the from your mistakes And that was been introduced in the context of imitation learning Which these days is set equal to apprenticeship learning despite peter abeal Actually meant by apprenticeship learning the combination of in imitation of Immutation learning through inverse reinforcement learning followed by additional reinforcement learning and the important The position is that the reward function provides a most more succinct and transferable definition of the task Than what the policy can tell you in terms of the behavior And in many domains, this has become super powerful and well, for example, well for Well model and you can do some modeling of agents You can ask answer other scientific questions like how do bees forage? How do songbirds Vocalize what cost functions underlying these kind of animals as are humans So I mean for human movement. There's a long-standing question of Is it minimum jerk? Is it minimum torque change? Is it minimum endpoint variance? Which generates human arm movement? And that's a big debate while for locomotion. It seems to be pretty clear It's minimum metabolic energy, but finding these cost functions That actually would be well job of inverse reinforcement learning And we do the we'll discuss this a bit within this crusher robot scenario from well, we have seen before in the introduction And inverse reinforcement learning would actually want to learn the cost map now for a planner in our control policy He he generated any form of control policy generation such that you would Not directly learn the actions And while you directly recognize well, you've had some features and some parameters Each of this would be a feature. This would be this is obviously features to avoid So the big negative reward he could place a big positive reward and would obtain Well, you could obtain a good plan from it So parameter would be high cost associated with brush low cost associated with grass And Well, you could collect paths by teleoperation by taking this gigantic vehicle And you have goggles on so that you see what the vehicle sees and you Teach it how to move Subsequently you may have the task of getting from here to here And this is obviously a helicopter view and if your training tells you stay on the road Well, it will actually do exactly this kind of planning stay on the road drive from a to b It works also for other paths Here you see an example of the underlying cost function Where the cost function shows you well, this is a bit to the left and to the right and Well, we are here on well, we are basically choosing the minimum cost, but you would really be caught up if you were here Or here and this is from well cost from the satellite map, but you could even teach it to avoid the road And well here you see an avoid the road scenario But you want to brush want to drive through the brush, but you can should not accidentally do with your own car Really use such a crush on a vehicle And this also works across other kind of Terrain and here you see that the cost map is also very very different Good So let's cancel that one So how would how does this work? the The basic idea is that well We have a latent reward function which describes in which what we want to accomplish We have a reinforcement learning optimal control method which gives us a policy And we have the dynamics and now obviously given A policy or behaviors traces from a policy. How could we recover the reward function? Well, that's the idea of inverse reinforcement learning. There is there are three methods in the literature for it which dominate things and One is the maximum margin approaches which One are the maximum entropy approaches and the third are the Well, our direct parameterizations of the policy So let's have a look at rid of all the Let's have a look at how this how behavior cloning worked We have a cloning we had traces of the teacher. We had long-term behavior We would fix a policy class. We would estimate a policy and Well, the long-term behavior would really be the problem in inverse rl We follow a different trajectory a different path We again take the trace of the teacher. We want to have its The teacher's long-term behavior In this case, we assume a transition model But no reward function and we try to recover the reward function That explains the policy and the long-term behavior of the teacher best So with other words, the big core question is Can we we use the candidate reward function to obtain the policy of the teacher and how to find it? Let's contrast this one more time We have a cloning is simple to implement has very few assumptions um, but does badly on long-term behavior And well generalization is more complicated and samples Well needs many samples inverse reinforcement learning requires that you can solve the reinforcement learning problem involved and obviously It's hard if you are doing well a high-dimensional robot, for example But if you can do it The reward is a very compact description and it's very easy to transfer it to new task Yes It is an impulse problem totally Give me a moment so Let's start exactly with the comment of its ill post So what we want to have is ill policy We want to find a reward function which explains the expert behavior So we assume first of all we have a very limiting assumption We assume that the expert is optimal with respect to this reward function Now this you can now go to the psychologists and they will tell you humans are never Optimal and you can start a long discussion with them that Well optimality That optimality under what information state and then never mind It gets big you can very quickly find a big discussion there But let's assume this expert is optimal in this case. We want to find a reward function such That for this reward function the expert By a star beats all above beats or it's equal to all other policies And immediately when you do this you recognize ouch. This is totally ill post. I think this is what you meant, right? Um, we have a reward function of zero would always fulfill this property Then we have the problem of well, we don't assume the probability distribution Which in pi, but we actually only is we only observe traces And well, what do we do if the teacher also makes mistakes? And then the worst is he is for all policies And we obviously would have to enumerate all the policies Let's do this first with the maximum margin approach which made peter abeal famous And got him his job at berkeley where he used a feature-based representation of the reward And when you have a linear in features and parameters Well, you can actually plug the parameters of your reward function Out of all of the expectations And you're basically only standing there with a feature average and Well, some parameters And when you substitute this into our basic assumption Well, you have one parameter of a vector of the optimal policy And well, there's a weight vector and you have another parameter vector for each of the other policies now This would mean that well, we would have to find if W star such that well our feature x feature collections for the expert Are always beating the other policies It directly has two important implications since feature expectations we can estimate without actually having access to the policy and Thus have solved the limited data challenge and the number of expert demonstrations Well, it actually scales linearly Scales well with the number of features in the reward function. So much better than like for the policy and then basically well We do not depend on the complexity of the policy Or the size of the state space But we actually really just depend on Well, that there is this concise feature function Concise reward function represented by our features So in other words, we got rid of well one of the problems and now we do the next problem And that is that we don't want to enumerate but we actually want to That we actually want to Well, it'll only compare to a finite number of policies and well, this actually is a big advantage because you can Plug this into the mindset of the support vector classification, which I think you've seen in this summer school already and well compute now just the distances to a hyperplane Separating the optimal policy here from all of the others in the right feature space if you Which then allows you to do a reformulation And by this reformulation also solve this problem of well Being opposed since you can add a plus one here so that your optimal policy should be better By a well plus one you can also give this a dimension but the dimension is subsumed by the w That you have to be better than this in a minimum norm sense And from all of your knowledge about this support vector machine You directly know this is basically well Is basically the same qp and you can actually well Let's stay here. You can actually solve this qp Um if you have been given a limited number of policies You can even make this slightly better by now taking a seeing a different margin Depending on your two policies is so that Well, you incorporate some form of distance between policies He's so for example the number of well minimum distances is from the generated path There for the example path This is taking away Another problem. We don't have the ill postness problem anymore Now We Well, now you can basically go to the part that you well want to of course encourage high losses Is whenever you're doing well too much damage too much You have too many while your hyperplane is moved such that While you're classifying this as a potential optimal policy And um well And you can then Introduce sub optimality Just the way as you do it in the support vector machine by introducing slack variables And that way you could even deal with learning the solution for multiple markoff decision processes Not just to one Still there's a lot of challenges when it comes to the large problems and Well, you could do way more than more for large problems Now I will skip because I'm have only two minutes left from what I understand I'll skip the constrained generation But that's basically one effective way of how you would create obvious new policy candidates In in the same time where you compare where you create these new Pies you would automatically Um Well, you can automatically you find the Well, you can find it better and better to return function Now I want to highlight on the other hand one approach which In the last one minute, which I have Um, I'll maybe take a little few minutes extra Um, if you don't mind I want to highlight the maximum entropy approach, which I personally find much more appealing Where people follow where well people follow the game this premises that Well, you want to be minimally committed maximally uncertain about your actions Um subject to that. Well, you are your your Policy agrees to the feature averages you have encountered And nearly all of the distributions We know actually out of the expansion distributions, for example directly come out of the maximum entropy policy And this really allows us a proper treatment of both suboptimal distribution a suboptimal Experts while at the same time giving us Well directly a stochastic policy, which I find again more appealing so Here we have the entropy and he would have constraints If you solve this well, you would actually get an expansion distribution. You would get a Gaussian the same way We can actually do this over paths So it's just like for policy gradient. We could do this over path from of trajectories which then have to trajectories of the generated by the teacher's policy And we have to match the The policy the feature averages of the teacher And get the right kind of path distribution And we directly recognize. Oh, wow. This actually works quite straightforwardly We could even plug this into the probabilistic movement primitives from the behavior cloning link lecture and recover that way the reward of the trajectories and well The only problem is well, you need to figure out how to bring in the system in this case If you want to bring in the system, you need one more step You need to have a state consistency as down here and Well, if you do this You end up with a policy as a function of Basically the q function where the we are the the grandeur multipliers And while this with the normalization together, this becomes a softmax And this is a convex problem. It's actually you can write it down as an x log x sum So you can actually do this quite effectively Now skip the last one Of policies parameterized by rewards But that's basically the dual to that you could also just write down the dual right away And do optimization on it, which gives you a parameterized policy This brings me now to the last part Applications there's been a hell of a lot as you can see from this and I will Actually, these are fun. These I should show you Hopefully they work So here you see driving behavior taught by peter abiel And well, you notice how In one case, it's learning very clean driving in the other case. Sorry In some cases it learns very sloppy things in this case in the demonstrations Here you see what the robot has learned some very sloppy behaviors among that but all over The simulator robot has been doing quite nicely Still sometimes it will do something very crazy But all over this is actually quite impressive Cancel Then you can do well parking lot navigation. Let me directly move to the video Um, here you see first a very good driving style. It has point clouds from an rgb camera Um, and it's well this first driving style Um, it will nicely go around and will do a very favorable way of getting to this parking slot Um, hope this works Okay, beautiful And move this presentation to a new computer that was a really bad idea Okay, let's leave these out since they're kind of boring, but you can explain humans quite well by it That's all it is saying And let me finish with The most famous example And that is the helicopter work of again peter abiel and enring Who used such a helicopter to actually get to a human performance And acrobatics Of real experts Where they're both the well who are competing in competitions And they would collect the data from the experts who use these joysticks to do acrobat helicopter acrobatics They get imu data. So again imu's you have seen them on monday In order to know what the where the gravity vector is and the accelerations of this This helicopter and they have cameras observing where this helicopter is at Which all of which gets um, well Executives used by the computer They first do a kelman filter and then they um, which they obtain from that data And then they um, do a feedback learn a feedback controller And well, what are the They have four control actions since you have four things you can joystick And let's hope this I can't believe this. Okay, let's check this video Okay, then this is God, why did this Okay, then in this case I will directly show it from youtube. That's probably the smartest thing This is really something you should have seen at least once when it comes to inverse reinforcement learning these were the milestones of full These behaviors were really really tough back then you must imagine this is 2007 2008 At that time we did not have these fast server controllers in the helicopters as today Where humans can control them much better than then you really needed to be really really good as a human to joystick they behavior and The and the learning system, which they had actually managed to Learn all of these crazy behaviors when there's a loop the loop tick tock purats again loops and well different turns This is called a hurricane. I'm not quite sure. Well, maybe because it's kind of a spiral Think you've had an inverted helicopter in there, too Yep, this looks pretty inverted and well gives you kind of a feel of You just don't want to sit in that helicopter, huh? So this is an inverted helicopter hoovering even Which in Well again 10 years earlier was considered an unsolvable problem helicopter acrobatics to do hoovering because well people couldn't actually manage and well, let's Close that part Well Ah, okay. The video would have been here. Fantastic. Okay I screwed up by thinking I had screwed up. I actually had the video here a little bit later coming Um, this was just a photo before that Never might then but importantly so In the end You should know why inverse rl is sometimes better than direct imitation running or behavior cloning You should know the arachnid make challenges some methods What is good about maximum margin and max entropy maximum margin originally got us there I think the maximum entropy you see is a much cleaner way of doing things And yeah, I hope to have given you somewhat an overview Over most what is interesting within robot learning I've only left away the problem of well map billing and and estimation Which some people also counter to robot learning And I've taken 10 minutes more than I was supposed to I hope that was okay