 OK. So welcome back everyone to the second session for today. And our next lecture would be given by Gerard Neumann, who is a professor at KIT and head of autonomous learning robots lab at KIT as well. So can we welcome the speaker? Hello, hello. This is much better. Yeah. So this is a very wide field. So I took a few topics out of it that were the most interesting for me, and let's see that it's also interesting for you. And I think it fits very nicely to the talk that we have seen previously. But let me start with motivating the topic. So why do we look at robotics? My standard answer is because it's actually a lot of fun to watch these robots learn. So you see different examples where you can use reinforcement learning in robotics, like robot manipulation, for textures hands. That's an example from open AI or select robotics. And the things you can learn with reinforcement learning are actually quite nice. But obviously, there is a huge economic and industrial potential in using reinforcement learning for robotics. So like for bin baking primes, assembly disassembly, agriculture, service robots, reinforcement learning is not really used so far. So we are not there yet, but I think it will not take too much longer. So maybe the next five years and we'll see a big impact there. But the third answer is robotics is a very, very challenging problem, and it unifies almost all the big challenges in reinforcement learning. So it's a very good test bet for reinforcement learning algorithms. So what are the challenges? Once in red, you have here the challenges and some of the solution concepts here in green. I won't have time to cover all of them, but we go into a few of them. So yeah, what are the challenges? So data, obviously, is very, very costly. On the real robot, we just cannot run millions of trials. That's not possible, and a robot will break. Then you have very high-dimensional observations. Usually you want to learn directly from cameras. You have high-dimensional states, high-dimensional actions, and so on. In partial observability, you cannot observe the real state. You can only get the state by some sensors, like the camera. Doesn't tell you many things. Like with occlusions, you have to deal with that. Then exploration and finding the correct policy compensation is another very hard challenge. So on the robot, you better not explore the standard reinforcement learning way, which would be more like this random walk. This will break your robot. It's very dangerous. And also, how to define rewards is a huge challenge. So for complex manipulation tasks, for example, it's unclear how to write the reward down. And there are different methods for that as well. So in this lecture, I want to go into three of these challenges. And offer you some solution concepts for that. So the first one is how to tackle sample complexity and continuous actions in deep reinforcement learning. And here we'll look at off-policy actor-critic algorithms and maximum entropy reinforcement learning, which is highly connected to what we have seen previously. Then I will talk about robust policy optimization, which are optimization methods that you need for complex high-dimensional action spaces, which typically occur in robotics. And here I will introduce you a new algorithm similar to PPOTPO that is called differentiable trust region layers that has some beneficial properties. And in the end, I want to talk about policy representations and how you can use them to get better exploration properties. And here, I want to talk about motion primitives and also black box optimization as a variant of deep reinforcement learning. OK. And yeah, so if you have questions during my talks, then feel free to raise your hand and we can directly discuss it. OK, but before I want to jump into this topic, I want to tell you a little bit about what are the systems we want to learn on. And I think there are two groups in the community. One group is mainly doing things on simulation and trying to transfer the learned policies to the real robot. And here we have the advantage now that many of the simulations get more and more powerful. Right now, we have these three different simulators that I show here. For example, Motoko, Sapiens, and Isaac, or Isaac Sim. And obviously, with a simulation, it's much easier to work. You can use many, many more samples because you can massively paralyze it. So you can use methods that are actually data inefficient but computationally fast. And this is also an advantage of certain methods here. But in the end, there will always be a seem to real gap. But the simulation, for example, can be used as an efficient pre-training. And then you only have to find you on the real robot. And what things you can simulate already is complex robot contact dynamics. For example, here for a box-bushing task, how to push a box to a desired location. Here, the Sapiens simulator can basically simulate various manipulation tasks at once. It's very fast. Can simulate on GPU. And here, it looks like, actually, a real robot. It's a simulated robot where you even can simulate how to screw a nut on the bolt, including the deformations on the nut and something like that. On the other hand, many people also try to directly learn on the robot. That's much harder. So here, you really have to use algorithms that are much more data efficient. Usually, these algorithms are commendationally very slow. So that's a disadvantage. And they also usually have a higher bias. We come into that. Higher bias means they give you a policy faster, with less samples. But the quality of the policy is usually not that good in comparison to the more data inefficient algorithms. So things people have done here is, for example, also directly learning on the robot is contact-rich manipulations, like back in the hall, the electrobotics task that you can see here, or a pinpicking task from Google. On the real robot, obviously, it's much harder to paralyze it. You can only do that if you're Google and have seven or 20 of these robots. Usually, you cannot afford that. OK. So that's basically the bigger picture of these two different families in the research community. And we will have a look into both what kind of algorithms we can use. The point is always resisting after these videos here. So I'll come on the next slide. Yeah, OK. And for the types of algorithms we can use, I actually like this kind of categorization quite well, which is stolen from such a Levin's lecture. But I want to flash it here. You have a very coarse categorization of reinforcement learning algorithms. And you can more or less divide it into on-policy and off-policy algorithms. And the on-policy algorithms, usually policy gradient methods such like BPO and TPO, they do have a low bias, so to say. Because they directly optimize what you want to optimize. So return. There is, at least in the plain version, there is no approximation in it. You don't rely on the Q function. The Q function could be wrong. It will be wrong. So here, no, you directly optimize the return. And that leads usually to better policies in the end. They are also computationally very fast. So typically 10 times, if not more, faster than off-policy algorithms. But they cannot really reuse data. That's why it's called on-policy. So it can only use data from the current policy. All the old data you have generated, you need to throw away. So I think that has been covered, to some extent, in the policy gradient lecture. So direct learning on the road is infeasible. But for learning on a simulator, actually, that's quite nice, because the algorithm is very fast, computationally. And if you can parallelize your simulator, then you should use this kind of algorithm. On the other hand, there are off-policy methods, which are basically in the field of approximate dynamic programming, which we have just heard in the previous talk. So they can reuse all experienced transitions from before. You're using a repo buffer. You can basically learn your Q-function. So that's better suited for real robot tasks, because it's much more data efficient. But they have a higher bias. With higher bias, the gradient we estimate for the policy, it's biased. It's wrong, because of our approximation of the Q-function. The Q-function will never be really accurate. And we have heard a lot about approximate dynamic programming, the QN, basically, for discrete actions. It was a little bit covered. Continuous actions was in the previous talk. But what I want to show you here now is basically how you would really implement that for continuous actions. And these are basically three quite famous algorithms, DDPG, deep deterministic policy gradients, DD3, and SAC. And we'll quickly go into them in the first part of my talk. So these algorithms are basically like Q-learning. So to do off-policy learning, we can reuse all the data that you've ever experienced to learn the Q-function. But they don't use the max operator. They use also a policy representation and use the Q-function to update this policy representation. Typically, if you talk about continuous basis, this is a Gaussian policy, because this is the only distribution we can efficiently sample from in a continuous space. So what is a bottleneck? It's the optimization of the Q-function. We have to find the maximum of the actions. And off-polisector critic methods basically do two steps. So they, first of all, have to update the critic for the current policy. So the difference, again, to TQN is that you don't do it for the optimal policy, but for the current policy. And then you have to update the actor. And there are different variants for doing that. The Stochastic Actor Update, which would be using a traditional policy gradient. The Domestic Actor Updates, which are these DDBG and DD3 algorithms. And I call it a variational actor update, because it's inspired by variational inference, which is the soft actor critic algorithm. And we'll quickly go into all three of them. But for the critic, it's more or less the same what we have also heard in the previous talk. This is basically the TQN learning rule. The only difference from TQN, from standard TQN, is that usually we have the max operator here. But now we do have a policy. This is a Gaussian policy. And we want to compute the expectation over the Q value in the next state. And so we can basically approximate this expectation by sampling. So we just sample from our policy in the next state, compute the average. And if you're lazy, we typically are lazy. We set the number of samples to one, and we can just basically do that. For the actor update, what you have learned already in the policy gradient lecture would be this update here. So you basically have the policy gradient objective, which is basically you want to optimize the expectation with respect to the policy offset advantage function. And then you can use basically the policy gradient for that sort of likelihood ratio. You can do this. But it's rather inefficient to do it like this. It has high variance. Because you only use samples from your actions, you don't use knowledge about the Q function. Because now the big difference to standard on policy gradient methods like PPO is we know the Q function. If you've learned it, we know the approximate Q function. And so a better thing to do here is to basically use the knowledge about the Q function, so the derivative of the Q function to also get the gradient of the policy. And we look into that, how this works. So the first algorithm that introduced that was called deep deterministic policy gradient algorithm. It's still used quite OK. But I think the SACA algorithm is fine, I understand it. But the issue, as I've already said, with the standard policy gradient algorithm, if you would use it here, is that you don't use all information you have. You know the information or you know the gradients of the Q function with respect to the actions as well. It's a neural network. So why don't we use this information? And for deterministic policies, this is a DDPG algorithm for stochastic policies. This is a SACA, a soft-actocratic algorithm. And the idea of DDPG is actually very simple. So we want to have a deterministic policy. In the end, we will use a stochastic policy by just adding some noise on top of it, but the noise is fixed. That approximates these arc marks over the Q function. And because now this is a deterministic policy, we don't have to write it as an expectation over the actions. And then we can directly plug in the action here. So our objective change is in a way that the Q function is evaluated at the action that my policy is specifying. And now I can compute the gradient with respect to the parameters of the policy, theta, in this case. And this is now a blame application of the chain rule. So if I want to do that, I get the gradient of the Q function with respect to the actions times the gradient of the policy with respect to the parameters. And this gives us basically, in this way, we reuse or we use the gradient information of Q with respect to the actions. And this gradient usually has a much lower variance. It's much more accurate than the standard policy gradient, if you would use that. Then the next extension of this idea was a twin-delete DDBG algorithm, which is called DD3. And twin-delete DDBG is basically, again, connected to the overestimation bias. I think there were one or two slides in DQN about what is the overestimation bias. So there's this double DQN algorithm, which is solving that. The overestimation bias is basically always telling us that the Q function, which we estimate with approximate dynamic programming, it will always, it's typically overestimated. And the reason for that is the maximum operator. We use the maximum operator for the Q value in the next state in the next time step. And because the Q values are not accurate, they are noisy, we overestimate that. And this will be propagated back through a state space and we'll get quite high error. And in a double Q learning algorithm, it was a very easy fix. We just use basically the dagger network for computing the Q value of the policy. But for computing the best action, we used the real Q, not the target network, but the real Q network. Here, now we don't have a max operator anymore because in our critic update, we just have this expectation over the actions, isn't it? But still, this policy has been used, the Q function has been used to optimize the policy. So if the Q function is wrong, the policy will be wrong. So you still have this correlation between error and the Q function, error and the policy. And that is exactly causing this overestimation problem. Basically, what you can see here in this plot is that if you would now evaluate, for example, DDPG, and ask him what is the expected Q value, you would basically get this estimate. If you compute the real expected Q values or returns, then you get an estimate that is much higher. So this is exactly the overestimation problem. So the twin-delayed DDPG algorithm tries to solve that in a quite simple way. They basically say, let's just use two different Q functions. So we learn two Q functions, and we also learn two different policies. That would be the main idea. And for updating policy one, we use Q function two. And for updating policy two, we use Q function one. So we decorrelate the error in the Q function with the error in the policy. So that would, in theory, work and wouldn't give you any overestimation bias or less overestimation bias. But in practice, these two Q functions that you learn, you learn it from the same data set. And so the results will be highly correlated. So that doesn't really help. The easy fix for that is actually, instead of saying we have an overestimation bias that introduce an underestimation bias with using the minimum operator here. So for computing the target values for your critic, you use the minimum over both Q functions, whatever the Q functions tell you. And then you underestimate the Q values. But underestimation is not that bad because it doesn't propagate over the state space because the policy won't use these actions if you underestimate them. The optimization will only force the policy to use actions that are overestimated. So using this simple trick and now basically, more or less, all the active critic of policy active critic methods are using that. Very easy to implement. You can avoid it. The only issue is that you need two different Q functions. Computation is a little bit harder to do. So here, that's basically the algorithm, the DD3 algorithm. And it's a very simple variant of approximate dynamic programming, basically, where you just have to extra update using the chain rule. And you also use target networks for the Q function as well as for the policy. And here, they use not the standard rule that after some time we update these target networks, but we use Polyak averaging, which is quite equivalent with a tau here that is pretty close to 1. So you slowly move the target network basically to follow the real networks. OK. So now we know how it works for the domestic policies. But the domestic policies have the problem, OK, you need to explore as well, isn't it? So you need to add noise. And if you don't optimize for the noise, then usually it's very hard to set the noise levels. You get stuck in local optima and so on. So you want to use something with stochastic policies in the end. And yeah, OK, just a slide saying that the DD3 is better than DDBG, OK. And this is the soft actor critic algorithm. And yeah, so the solution is basically to keep diversity or entropy in our policy high. And that's exactly the maximum entropy reinforcement learning problem that we have already seen before. So we define now a new objective where I basically say I have the Q function. I want my policy to optimize the Q function. But at the same time, I want that my policy is also optimizing the entropy. And the entropy is a natural information metric measure for uncertainty, or maybe also diversity, of the action choice of the behavior. And so you can see the equation here. For the distributions that we usually use in continuous spaces, Gaussian distributions, you do have a closed form solution. That's what you can see here as well. And this closed form solution, it only depends on the variance of the Gaussian distribution on the covariance matrix. In most cases, actually, our Gaussian distributions that we use, the variance doesn't depend on the state. So the entropy of the policy doesn't depend on the state. It's a constant over the state. And the entropy of a Gaussian distribution, it can be negative as well. So in discrete spaces, that cannot happen. But in continuous spaces, it's a differential entropy that it can happen. And how the soft actor critic algorithm now formalizes the actor update, you can see that here. And that's almost what we have seen in the previous talk. So we want to find the policy that maximizes the expected Q values. At the same time, we want to have entropy in there. And oh, this should be a plus here, plus entropy. Sorry for that. And which leads basically to this objective. We could, again, use now the standard likelihood ratio policy gradient for that, that is written here. But again, we have the same issues. It has a high variance, needs a lot of samples, and doesn't make use of the gradient information of Q. So and the message here is, if you have gradient information of your objective that you want to optimize, you should use it. And in this case, it's a reprimandization trick. It was also mentioned from Matthew's talk, but he didn't really go into that. So who knows here what the reprimandization trick is? Only a few people. So that's good. So I can go into that a little bit more, because it's a very, very important trick. It's not just used in reinforcement learning. It's used in machine learning all over the place. So the reprimandization trick solves the following problem. So we want to find a distribution p over x, this one, that maximizes this expectation. It's now very general. And the distribution is parameterized by some theta parameters. If we don't have gradient information for F, the only way to get this gradient would be the accurate ratio that we have already learned. If we have gradient information over F, we should use reprimandization. So how does it work? So we can reparameterize the expectation in the following way. So we introduce a new random variable, xi, that is distributed by q of xi. And q doesn't have parameters. It's parameter-free. For example, it's just a Galician-Miserie mean and unit variance. And then we also define a mapping age that has now some parameters, theta. And this mapping transforms this xi into x. And if you have such a mapping such that these x's are distributed according to p of theta and x after the transformation, then obviously, both expectations are the same. So this expectation with respect to p of theta is the same as the expectation with respect to the distribution q of xi, f of h of xi. And the trick that we use now is basically that the random variable is basically the parameters. We first had it at the distribution. Now we have the parameters here in the function that is used basically to call f. And so let me give you one example. And that's also the standard example that we will use, Gaussian distributions. So for the Gaussian distributions, let's say the b of x is the Gaussian distribution that's mean and covariant. So our parameters are the mean and the covariance matrix. Then I can basically, again, define this distribution with Q with alternate parameters as zero mean unit variance distribution. And the mapping h theta of xi is the mean plus a matrix transpose times xi, where the a matrix is the trelesque decomposition of the covariance matrix. And it's very easy to show basically that this b of x that I get when I transform xi into x is the same as the original distribution. And that's basically the standard way how we use the reparameterization track. Because now we can compute the gradients not based on this equation here, but based on this equation. And as we know, q doesn't depend on theta, so we don't care. But h depends on theta. And here we can, again, estimate or compute the chain rule in order to get the gradient. And basically, that's the way how you should do it, how you compute a policy gradient if you know the gradients of the objective, much more efficient than the standard policy gradient way. So for the actor update, back to the SAC algorithm, it would look the following, that our policy is a new network. So usually, it gives us a mean that depends on the state. And it could also be a covariance matrix that depends on the state. Or in this case, the network is giving us the trelesque of the covariance matrix. So if I compute 8 transpose times a, I get the covariance matrix. This could be state dependent. In most cases, it's not, but not just to stay general. Let's say it's state dependent. Then our mapping function h depends on s on the state. And on psi is basically as defined before. And then we can just block that in the gradient for the SAC algorithm. So we have it here for the q function, but also here for the entropy term. And basically, then we get an unbiased gradient estimator. It's unbiased with respect to the q function. The q function will be biased. But given this q function, it's unbiased gradient estimator. And basically, this algorithm now extends this dbg style of policy gradient computation to stochastic policies. So we can also get a gradient for our noise. And that's exactly what SAC is doing. So here, again, we have the algorithm box for SAC. You can see it's more or less, again, standard approximate dynamic programming algorithm, where we now also added in the target values these terms here for putting the maximum entropy objective also into the value function. For SAC, actually, it's debatable whether you need it or not, because actually, this is a constant, as we know, for this in the state space, because the expectation or what I had is a constant, because entropy doesn't depend on the state. If we use a Gaussian policy where the covariance doesn't depend on the state. OK. So the SAC algorithm is, I would say, right now, state of the art for policy ecotacritic algorithms. So they compared it, basically, to many different other algorithms. So you can basically see here that the orange one is SAC. Then you have dbg is the green one. I think if you set the hyperparameters better, dbg is also working a little bit better than here. But usually, SAC is very good. And it also outperforms here the on-policy gradient methods. The reason is, actually, you would give on-policy gradients, like PPO, just much, much longer time in order to achieve the same result. And usually, they get better. But you see it's quite a bit faster than its on-policy gradient methods. These are now for different standard motor co-environments, the evaluations here. And here you see some evaluations on the real robot application, if the video is playing. OK, maybe let's skip that and look at the next video. It's another real robot example, but a used SAC on the task of turning this valve. So the blue part of the valve, it always needs to point to the right. And that's basically what it learned, just from image-based input, so the input to the policy you can basically see here. And that you can directly learn on a real robot. So it doesn't take too much of a real robot interaction time, something like one hour, a little bit more. It takes much more computation time. You will wait quite a bit between the iterations until you get a result. OK, so this maximum entropy formulation that we now used in SAC, it also has some other very nice properties. And I just quickly want to show that here, which is also presented in this paper, also from the living group. It basically proves that the maximum entropy formulation, it gives policies that are robust. They're robust, for example, to unseen perturbations. So for example, what the robot here was tasked to do is to take this back here and move it to the target area. And if you use standard reinforcement learning, you see, OK, you can learn that, no problem. But more or less, the final policy, it does it always in the same way, because there is no stochasticity in the final policy anymore. It's the domestic. The SAC policy on the right side, maximum entropy reinforcement learning, you see you have the different evaluations are different because you have noise in the final policy. There is stochasticity in that. And now you might think, OK, that's stupid. That's not optimal. But actually, it's much more robust, isn't it? Because in this policy, you do have also information how to recover from errors. Because you're deliberately doing something wrong. You also learn how to recover from errors. On the left side, you have forgotten that already. The optimal policy doesn't know that anymore. Because the suboptimal states are not in your replay buffer anymore. So if I now introduce, for example, after training, I introduce a perturbation that the algorithm hasn't seen, like this obstacle here, the standard reinforcement learning algorithm struggles with that. You cannot resolve it, while the max-end algorithm finds a solution. Because it has basically learned this recovery strategy directly. And you get similar properties if you have a model mismatch between the model that you use for learning and the model that you used for execution. Here, they basically changed the mass of the robot after learning. So you can see that here on the x-axis, the relative mass has been changed. And that's the different max-end reinforcement learning strategy, so smaller SAC with different alpha parameters. And you basically see that in comparison to other robust reinforcement learning algorithms that they compared here, it doesn't really matter what they do. It actually works much better. And it's also more robust to these changes. And if I set the alpha higher, so here we have alpha 0, 1, and here we have alpha 0, 0, 0, 1, then it's more robust to the changes. That's a difficult thing to see, except for here. It didn't really work. But the best performance at the relative mass of 1, so when you don't change, the model is a little bit worse, because you force the algorithm to learn a more stochastic policy. And for example, for seem-to-real transfer, that's a quite nice property, even so nobody has really used it so far. Yeah, so my wrap-up for SAC is that we have seen now an off-policy maximum entropy deep reinforcement learning algorithm that is very data efficient. It can scale to high-dimensional observations. You can even learn from images directly. It's quite robust, different random seeds, noise, and so on. That's why the community has basically more or less went or opted for this algorithm. There's a second algorithm, another off-policy algorithm, that also works very well called maximum posterior body optimization. And it's not really clear which one is better, but I think this one is used a little bit more because more implementations exist. OK, so how much time do we have? Still 45 minutes? Eight more minutes. But are there any questions so far for SAC? OK, then I want to continue with the second part. Robust policy optimization. And now we go back to the on-policy gradient domain. So now for algorithms that if you want to apply it on the simulator, we don't really care about the sample efficiency. The algorithm needs to be fast, and it should give us an unbiased gradient estimate. And what these algorithms typically do is they don't learn the Q function, and they don't depend on the Q function. They depend on the returns directly on the Monte Carlo returns. They might learn the value function for the baseline, but they don't learn the Q function. So they are easy to use, combination fast, and yield high quality policies. But what we have seen is that in many cases, optimization can still be tricky or unstable for complex policies. I think in the policy gradient lecture, you guys cover TPO and PPO already, isn't it? And both try to solve this unstability problem in some way, but not in a real satisfying way, I would say, because they use a lot of approximations such that these nice properties of stability of a trust region that we'll see are gone, and you don't have them anymore. And we'll go into that a little bit more. So maybe as a recap, so why do we need actually a robust gradient update? And I believe it's highly connected to the step size that you get of the policy gradient when we talk about Gaussian policies. It's very specific to the Gaussian policy, because what we typically do is we start with the initial policy. It has a quite high variance to cover all the space. And then we want to learn how to reduce this variance to converge to an optimum policy. And if you look at the gradient magnitudes during this learning process, they vary a lot, because the gradient of the Gaussian policy depends on the variance. If you have a small variance, the gradient will be huge. If you have a high variance, the gradient will be small. And that's why this algorithm get quite easily unstable. And the second problem is actually, and that's what we also see in many of these issues of this algorithm, is that the exploration of the variance, it decreases very quickly. Because the algorithm finds out, if I decrease exploration, my immediate performance will be better. But actually, that's not what you want. You want to explore it such that your performance after a certain number of iterations will be better. If you decrease exploration too fast, you just get stuck in local optimum. And you don't want that. And so typically, you get these two different cases. If you have a too small learning rate, you either update very moderately, and it will take very long until you find a good solution. Or if you have too high learning rate, you update very greedily. And you jump directly to the best solution you found without any variance. In both cases, are not good. And it's very hard to control that, as I said, because the magnitude of your gradient can vary so highly. So as I said, it's highly connected to the structure of the Gaussian policy, or of the gradient of the Gaussian policy. So if we basically have a Gaussian policy where, now, just for simplicity, just to mean depends on the state. This is a neural network here, and we would compute the gradient of the log density, which is the one we need for the likelihood ratio, policy gradient. Then you basically see what this gradient does. So you compute the difference of the action to the mean, and you scale this difference by the variance. So if the variance is small, this gradient goes boom. It has a very, very high variance in that case. And that's why it's very hard to choose the step sizes. And standard policy gradient without any step size control, like the reinforced algorithm, it just doesn't work. It just doesn't do anything for more complex problems. So what people have came up with is to formalize the stress region problems here. And that's very similar to what we have heard in the previous talk. Just that, in the previous talk, we had it as an additional punishment term, the kial between the old and new policy. Here, we have it now as a trust region. So what does it mean? So we want to have basically a new policy or new policy parameters that optimizes the advantage. Here, we have the importance-weighted objective, but it doesn't really matter. Under a constraint, or actually under many constraints, we want to satisfy the constraint that the kial between this new policy and the old policy is smaller than some bound epsilon. And in theory, we would want to satisfy this constraint for every state. It's quite hard to satisfy, but that will be the optimal thing. And then if you would satisfy that, we can even get monotonic improvement guarantees for these algorithms. But the benefits is that it stabilizes the learning process because it avoids that your policy is jumping around too much. And it also avoids too fast decline of the variance because then the kial between the old and new policy would be too high. And nearly all the successful policy gradient algorithms do that in one way or the other. TPO is a natural gradient algorithm. And the natural gradient, you can actually derive it from this kial point of view, which is a second-order theta approximation of it. And PPO is using similar ideas, but a lot of heuristics and tricks to get it to work. So the most common used measure, as we have already heard so far, is the Kovac-Liberti virgins, some of the properties. So it's always positive. It's only zero if both distributions are the same. It's non-symmetric. There are many different names for the kial, relative entropy, information gain, and so on. It's also used in many different domains, the aberration, inference, information theory. It's coming from information theory, actually. And but now we are talking about the Kovac-Liberti virgins between two Gaussian distributions. And here, also, we have a closed-form solution of that. And it's actually quite illustrative to look at it, how it looks like, because it contains three different terms. So the kial, also, for the Gaussian distributions, it's always large equal to zero. It can only be zero if both are the same. It has three different terms. So the third term, the blue term, it compares the means of both distributions. And it's scaled by the covariance matrix here of the right-hand distribution. Then you have the second term, which you can think of it. It compares the entropies of both distributions. Now, if the entropies are the same, then the second term will also vanish. And the first term, I would call it the rotation of the covariance matrix. So it's a trace of the inverse of the first terms, the second covariance matrix. So it's more a geometric interpretation of the kial if you talk about Gaussian distributions. You compare mean and the covariance in a very specific way. And one way to solve that, the trust region problem that I've given you before, is actually constrained optimization. And constrained optimization, here, it's an even more simple problem, which is now without states, just actions. So you want to find a distribution of actions where each action has a certain reward under a kial constraint. And Q is, in this case, the odd distribution. And another constraint, obviously, that the pi needs to be a distribution, needs to sum to 1. And constrained optimization is basically an own field. You can solve it using Lacrownian multipliers. And I don't have time to go into that, but I have time to show you the solution. The same solution we have already seen. That's a funny thing. It doesn't really matter whether this is a constrained or an additional punishment term where you have it in your cost function. So the new policy is proportional to the old policy times x plus a reward divided by a Lacrownian multiplier. The only difference to this punishment term that Matthew introduced is this Lacrownian multiplier. It can be optimized for. You know the value. So give me bound epsilon. You can optimize for eta. And you can compute eta. And again, you can see if you have a small eta, which means it directly translates into a large epsilon, then it's an almost greedy algorithm because the reward will dominate and the odd distribution doesn't matter. If you have a large eta, which comes from a very small epsilon in the bound that we had before over here, then the odd distribution will dominate. And you won't move very far in this distribution space. And so basically, that would be the optimal way to solve this frustration problem. But we can actually only solve it for this discrete action, no state case. We could also do it for discrete states as well. But what most algorithms do is to try to find approximations of this. So basically, there are natural gradients and using the trust region policy optimization algorithm, which is using natural gradients. Then there is also competitive function approximation that can solve that. I won't go into that. Then there is a proximal policy optimization algorithm, PPO, which was also already introduced in one of the lectures. I want to go quickly back to that and show you it's a good algorithm, but it has a lot of problems as well. And I will maybe then, after the break, introduce the differential trust region layers. So for the proximal policy optimization algorithm, it's also, as I said, an algorithm that has been discussed already last week. So for on-policy gradient, that's the algorithm people use right now. What it basically does is, so you have to standard policy gradient objective here. So advantage function, which is importance weighted. So you directly express this importance weight here of every sample. And then it forms this lower bound of the objective by having the minimum of the standard objective and the clipped objective for this importance weights cannot be bigger or smaller than 1 minus or 1 plus epsilon. And this clipping prevents the policy from having the incentive to go too far away from the old policy. So that's the intuition about it. But the question is, does it really do that? Does it really work? And somehow it does, but you don't have any guarantee that these trust regions are really satisfied and we'll see that. And it can really destroy your optimization completely. But the good thing is, if you detect these objectives, the BVO objective, you can plug in any standard deep neural network optimizer, like Adam, and he's doing all the magic for you. If it's a clip, some sort of bias variance ray of thought off because the importance weights, they can get very, very high. By clipping them, you basically avoid that. But I don't think it has been really analyzed in that way. I mean, you don't directly know what kind of bias you introduce here. You definitely get the variance down. I don't know what happens to the bias. I haven't seen an analysis of that. Because I think a similar idea is used for counterfactuals in order to recover past experience. So in order to reduce the variance, they have high bias by using this clipping track. So it's kind of the same. Yeah, for the retrace algorithm, for example, it's a clip at one, I think the importance weight does. They don't do the clip from below, or they don't do that. Say analyzed theoretically here, it's not analyzed what it really does. But it limits the variance. It definitely does that. OK, so you can do a lot of cool stuff with it. I mean, I'm not sure whether you've seen such videos already, but this is not from Open AI. They learned, actually, mostly in simulations. And they had a very good simulation, obviously. And then they could do the Rubik Cube or similar things with this textual shadow hand, directly in the real world. So that is very impressive. And as I said, it's currently the SOTA algorithm for on-policy reinforcement learning. So it's easy to implement. And basically, the idea is just throw tons of GPU on the program, and it somehow works. But as a reinforcement learning guide, it was always quite unsatisfying to use it. Because the series is not very nice. And actually, if you look at it, and then there are some papers to analyze PPO more closely, it's why it works so well. It's actually not because of the objective. It's because of all these additional tricks and hacks that they introduced in the paper. And if you implement it slightly differently, the algorithm will work quite differently. And that's basically in this ancient paper. So the performance boost mainly comes from the hacks. It's still much better than TPO, the previous algorithm, because it's much faster computationally. It doesn't need this natural gradient formulation. And yeah, and we will see one algorithm that fixes most of this issue. But I think now maybe let's have three, four minutes break, and then we can. OK, basically, we stopped me bashing a little bit on the PPO. It's still a good algorithm. We use it as well. But some of the theoretical, how you derive the algorithms, as I said, it's not very good as a hacker theory. And it depends on many of the statistics. So but what you can do, and that's a rather new algorithm from our group, is to introduce something that is to introduce distressed regions directly inside a neural network. And that's what we did. It's called differential distressed region layers. And the key idea is the following. So there has been, and now it's already also three, four years ago, a new approach on how to build convex optimization layers into a neural network. It's from Tico Kortos' group at CMU. And basically, the very general idea is if you have a convex optimization program like this one, you want to minimize something under some constraints, inequality and equality constraints. Then you can use convex optimization, so with lacronian multipliers, and make this optimization program differentiable. So basically, now the function and the constraints can have parameters coming from a neural network. And you can differentiate through these parameters. How exactly does it works? I don't have time to go into that, so you can basically differentiate the KKT condition. But I just want to show you how you can use this for reinforcement learning. Because for the trust regions that are introduced, we have exactly this problem. It's a convex problem because the call is convex. And the maximization is just an expectation. It's also convex. So if you look at the original objective, we have basically this one. As I said, it's just a linear objective, an expectation. And a convex constraint, the call, for each state. So most other matters don't solve it for each state. But for the expected call, here we can really solve it for each state. So the idea is now, if we can construct a policy, let's call it by Dilla, that always satisfies this constraint. So it's guaranteed that this constraint is satisfied. Then we can just leave it away in our optimization program and just optimize the standard policy gradient objective. And as I said, the idea is just to build this constraint directly into the policy. And actually, we went one step further for the Kruber-Gleiber divergence. You can actually decompose the Kruber-Gleiber divergence into one constraint for how much the mean should deviate and one constraint how much the covariance should deviate. Because it turned out that for the mean, you can allow much higher step sizes as for the covariance, because you don't want the variance to shrink so fast. So you have now two constraints, one for the mean, one for the covariance, for each state. For each state means basically for each state sample that you have. So we did that to achieve faster convergence. And so basically that's how it looks like. So you have basically a neural network here. This is your standard policy. It gives you mean and covariance. Then you have your old policy, also with old mean and covariance. And then you have this trust region layer. It's a differentiable layer, just as a neural network layer. It gives you a new mean and a new covariance, my dille and sigma dille. And you use this to define the loss. And then you can just back propagate through the objective to get basically a new policy. And the advantage is now, and first of all, the trust region is satisfied for every state. And it's exactly satisfied, so not approximately as for most other methods. So if you look what kind of protections you can use here, so we have said we use a different protection for a different distance measure for the mean and for covariance. So basically the optimization problem for the mean looks as the following, that you want to find a mean that is most similar to the output of the neural network. So this mu dille is basically the output of the trust region layer under the constraint that this mu dille and is also close to the old mean with some epsilon mu. You can now plug in basically the mean part of the Kruwer-Gleiber divergence for both of these distance measures here to get basically this optimization program. And you can, using Lagrangian optimization, solve it in closed form and basically get this solution here that the trust region mean is basically a linear interpolation between the output of the policy and the old mean, where this omega parameter specifies this linear interpolation can also be computed in closed form equation. No, it doesn't really matter. You can do the same for the covariance matrix. Basically, you take the covariance part of the KL for this distance measure. You can again form a closed form solution of that, which basically tells you that the trust region precision matrix of the inverse of the covariance is again a linear interpolation between the covariance of your neural network, policy that you have here, and the old policy. And again, eta is a Lagrangian multiplier. There is no closed form solution for that, but you can obtain it by convex optimization, and it's still fully differentiable by this secret quarter paper. OK. So now we have basically formed a policy that is always following our trust region constraints, and we can differentiate through it. And that's a very elegant way to solve all these approximation issues. And we tested it on the standard benchmarks. From Motoko, you had the hopper and the walker and so on. The trust region layers are here with different norms. So the KL is one norm, Wasserstein norm, Frobenius norm. And the result was actually, it works OK, but sometimes PPO is still better, with a big difference that the trust region layers they work without any of these additional hacks of PPO. So they just work. You don't need to do that. But still it's like, OK, it works more or less the same. But one good property, and that's what you see in this plot here, if you look at the trust region violation of these different approaches, you have basically PPO here in Magenta. You see it's going completely crazy, or it's actually one version of PPO that doesn't use learning rate control. Basically, the learning rate is constant. The other version of PPO, which is the standard one, the Brown one, uses basically the learning rate goes down to zero. So that's why you're seeing nothing is happening anymore, because you just don't learn anymore. Or it's not a very good solution. But you basically see that these trust region constraints, they are not satisfied. While for the other convex optimization layer algorithms, you have a very clearly defined trust region. And this is not a big, the maximum KL of all the states between the old and new policy. OK, so performance-wise, on the standard benchmark task, it was not so much better. So we were a little bit depressed about that. But actually, it turns out that this task was just too simple to see a difference. So here, PPO works very well. But if you take more complex tasks, PPO really starts to degenerate. So one more task where we evaluate it now is dispatchable tasks. So we have 50 tasks from different manipulation tasks that you have to do, like button press, door open, and something like that. And here, you can already see that TAPL, the trust region protection layers, are consistently better than PPO. So you can see PPO is basically the orange one. And TAPL is this brown one in this case here. You can also see where SAC is. It's hard to see, but it's basically going up until here. And it's more or less the same performance as PPO here. But it cannot outperform, for example, this TAPL method. We could not run SAC as long as we ran the other methods because it's 10 times slower computationally. So all of them have been run for the same computation time. OK, so we have now an algorithm that is basically a much more robust estimate of supportive gradient because of trust regions. And we wanted to use it for more abstract or more complex action representations. And that's basically the last part of my talk because in robotics, it's very important what kind of policy you use. And here, we can use a lot of knowledge from robotics, what is a good controller, what is a good policy. And standard new network policy with random noise at every time step, not so good. And what I want to show you here is how you can do, first of all, what are motion primitives. So movement primitives, very, very quick. And how you can use them for reinforcement learning. Sorry for that. So what is a motion primitive? I mean, there's a whole research area on that. But a very, very quick jump into that, what it is. A motion primitive is basically a detector generator. It gives you a dejectory, give some parameters. So basically, why is here our dejectory, or the dejectory of the velocities of the robot, give the parameters W and initial conditions. And there are different motion primitive formulations. It doesn't really matter for us right now. But the properties of the motion primitive is that it has a small number of parameters for the dejectory, let's say 20 to 50 parameters, defining a dejectory the future desired dejectory. And then you can just use standard dejectory tracking controllers that we know very well in robotics to follow that dejectory. So it has a small number of parameters. And so it comes from imitation learning. Basically, it can very easily demonstrate a dejectory and get the parameters of the motion primitive, such that you can reproduce this dejectory. And now our idea was, OK, we want to use reinforcement learning on this representation because it has some very good properties. And that's the ones that I want to focus on here a little bit. So in principle, there are two different variants of motion primitives that are mainly used in the robotics community. One is called dynamic motion primitives, which define a dynamical system using this parameter W. And by integrating this dynamical system, you get the desired dejectory. There is another motion primitive variant called ballistic motion primitive. It's coming also from our group where the dejectory is defined in simpler. You just have some basis functions in time. So it's basically Gaussian basis functions, very similar to a Gaussian kernel. Multiply it with W and then you get your dejectory. And recently, we had a publication on combining both. But it doesn't really matter for this talk here. But a ballistic representation, it basically defines a distribution of the dejectory, not just a single dejectory, but also for reinforcement learning, right? No, not that relevant. And we want to use now the motion primitives in a reinforcement learning scenario. And the scenario is actually quite simple. And we call it black box reinforcement learning, or contextual black box reinforcement learning. So how does the scenario look like? So you see a context. So basically, the context is a state telling you what the task is. For example, you want to move a box to a certain position. Now, the context is to which position you want to move it. Then, give the context, you want to select this motion primitive parameters. What is the dejectory you want to execute? Give this motion primitive parameters, I can evaluate the dejectory very easily. And given the dejectory, I can just plug it in a controller, BID, or BD controller in most cases. And the robot is doing the magic. So the big difference to standard reinforcement learning, and that's why we call it black box, is that this is like a bandit. You choose W only once, and then you execute. Done. It's only one action, but it's a high dimensional action. So there's also no way to recover. If you choose W wrong, then you execute and you're screwed. But you need to choose it in a good way. So the same action, which is the motion primitive parameters in that case, is used for the entire dejectory. But it's still quite complex because you have the context in there. So for different contexts, for different tasks, you want to take different dejectories. And again, you want to maximize now the return. So the return is basically the expected return. And that return depends now on this W parameter and the context. And you also have a distribution over these context vectors. Yeah, so in principle, this is a contextual bandit, but with infinite amount of arms because this W is high dimensional and continuous. OK, so why is this a good idea? So first of all, we have a small number of parameters that we want to optimize, this 20 to 40 parameters of dejectory. And we can directly explore in this parameter space of the motion primitive. And we don't explore in the action space anymore. We only explore in the beginning that yields to correlated exploration over time, not this random walk behavior. Then also, nobody has told us how the rewards should look like. So standard reinforcement learning setup is the reward needs to be macovian. It needs to depend on the state and the action here. No, the reward can depend on the whole dejectory. It doesn't matter. It's a very free formulation. And in addition, and that's, I think, one of the main advantages, there's much less noise. So in the step-based setup, even in a deterministic environment, the returns will be noisy as hell because I explore it every time step. So it's very hard to say whether the action, the first time step, was good or not, because afterwards so much noise is happening, I have no clue. Here, the deterministic environment, the returns will be deterministic as well. So it's much easier to find out which parameters are good and which are bad. But obviously, there are also some disadvantages. So first of all, the dejectory is just an open loop dejectory. So we don't react to unexpected changes in the state. But the patients, there's no sensory feedback, but we will come to that. And it's sample inefficient because every dejectory is just one sample. While for the step-based formulation, every dejectory is 200 samples or something like that, depending on the length of the dejectory. So actually, and that's why people didn't really do that because it sounds like a stupid idea. It's very inefficient in that way. On the other hand, if you compare to step-based reinforcement learning, you have a deep neural network policy that selects an action every time step. You have a huge amount of parameters in this policy. And the exploration is done in action space. Though the good thing is very flexible because you have this closed loop feedback using a neural network. But as we will see, it's very inefficient for the exploration. There are other methods also for dealing with that, but we cannot go into that. The wise is the case because the Gaussian noise introduced by Gaussian policies is like a random walk. And random walk is not very good for exploration on a real robot. We shouldn't do it anyways because it resides into very unsmooth behaviors. And I've also told you that in a black box setup, actually, we can use much more general reward definitions, or return definitions, which we call non-Markovian returns. So why does it make sense? Because often it's easier or more direct to define the desired behavior like that. If you look at the whole trajectory and judge the whole trajectories and not a single state. One simple example, I guess some of the videos are broken here, but one simple example would be jumping. If you want to jump as high as possible, it's very easy to give a reward on the trajectory, the maximum height. For a step-based algorithm, it's very hard. You could give the height at every time step, but that's not what you want because then you would jump all the time a little bit and not achieve the maximum height. The same for a table tennis environment. If you want to hit a ball in table tennis, one natural way of defining you want to hit it is the minimum distance of the racket to the ball and this minimum over the trajectory. It's, again, very hard to define as a Markovian reward. Then you can much easier define it like that. So in many setups, defining this Markovian reward is not that intuitive and it might bias the behavior that you actually want. Let's accept now that black box reinforcement learning might be interesting as well. Even so, it's quite inefficient. So as I said, the same action is used for the entire trajectory and basically the consequence here is that we really need to learn a highly accurate policy because we have only one shot. If we choose to do it slightly wrong, the performance will be very bad. So we need a reinforcement learning algorithm here that is giving us highly accurate policies. And the first thing we try it is to use PPO in the setup. So only one time step high-dimensional actions and that doesn't work. But the French overdraft region layers because of these nice properties that I tried to show you say do work. So here you have some quite simple case studies. So we evaluated or we started our relation with two different environments. One is a reach environment where you have a five-link robot and it needs to reach different desired positions. So the context is a desired position. And the other one is a box-bushing environment that you can see here below. So these we use in the work. Where you need to push a box to a desired location and desired orientation. So actually that's not that simple task. And we also played around with different types of rewards. Dense rewards and sparse rewards. And here we mean sparse in time. So the dense reward gives the distance to the desired location at every time step. The sparse reward gives the distance to the desired location only in the end, the last time step. That might not seem like a big difference, but actually it is. So because the step-based reinforcement learning algorithms like PPO, they are really made for this dense setup. The reward need basically almost every time step it needs to tell you what to do. For the sparse setups, they don't work very well. While black box reinforcement learning, it only gets the return anyways. It doesn't matter which time step it is. So it also works well in the sparse reward case. So here you can basically now see the different behaviors. So this is now for dense rewards. You can actually see for the box pushing task. You also see how it has to change the rotation of the box sometimes so the blue side needs to correspond to the blue side of the green box. And that seems, that works pretty well. If you want to do that with motion primitives and use PPO as optimizer, it does work, but it's not very accurate. But if you use these trust region layers, it is much better in a way. Because the policy gradient using the trust regions is much more accurate or no approximations. Yes. So the question was, do you care about the whole trajectory or the function of the trajectory? Yeah, it depends on the task, obviously. I mean, here it's the dense reward. So it's a whole trajectory. In the sparse reward setup, it's only the last time step. But for a distance to the goal, but you have additional rewards for every time step, like the action punishment, as you want to use low actions all the time. So you have that at every time step. So it depends really on the structure of the problem. For the non-mochalvian thing, you would use a minimum over the trajectory or a maximum over the trajectory or something like that. So you don't know which time step it is. And this is the same for sparse rewards. And here you can already see that PPO is finding something funny. So it's trying to throw the object to the desired location that sometimes works, but not most of the time. While this black box reinforcement learning setup using motion primitives, you see it does quite a nice job here. OK, now you could actually ask the question, why do we want to use sparse rewards if the dense rewards already work with reinforcement learning, step-based reinforcement learning? So it turns out it depends what you want. For example, if you want to have additional objectives like it should be energy efficient as well, the motion, this is very hard to get with a step-based reward function. Because the step-based reward function is forcing the agent to move very fast to the target. At every time step, you want to reduce it. While the sparse reward function, it only cares about the end and doesn't matter how you get there. So here you basically see the energy efficiency for different contour costs. And you see, depending on the goal distance, this black box formulation, it gets for the same goal distances much better energy efficiency that you couldn't reach with a step-based reward. And another property, and that's what I said in the beginning, is the smoothness. So we evaluated now different step-based and motion primitive-based policies at the beginning of learning and at the end of learning. And you can basically see at the beginning of learning, this is just a random policy, isn't it? So that would be a positioned trajectory of one of the joints of the robot, velocity trajectory, action trajectory. Actions are obviously very noisy. While for the motion primitive approach, everything is nice and smooth because you just generate a desired trajectory and then you follow it. But even for a trained policy, then, you see the trained policy is much smoother with PPO because you reduce the variance and you learned to generate something smooth because that's part of the objective. But it will never be really smooth because every time the state is different and there is no constraint for the policy to be smooth. You get something like this kind of action profile. So it will always be quite less energy efficient than the trajectory-based version. So we also applied this again on these meta-well tasks, 50 different tasks. And also here you can see that the quality of defined policies, so here you don't actually need any sensory feedback. You can just define from the initial state what is the trajectory you want to execute. There are no perturbations. The motion primitive-based variant, which is in this case the blue one, it even outperforms the translational layers in a step-based variant because of these better exploration properties and smoothness inherent smoothness built into the policy. We also looked at other examples like the hopper, that was part of the motivation for the non-mocovian rewards. If you look at different algorithms, again, the blue one is translational layers. The green one is PPO with motion primitives. And this is just PPO, the orange one. It's always interesting to compare the behaviors. You see PPO would basically, it would jump and then it would jump all the time. It's not really high. You can actually jump much higher. And from the motion primitive-based black box formulation, you get such a jump, which is quite a bit higher than for the step-based versions. Because you can directly say in the reward, the maximum is what counts. And yeah, for some more results, you can see that here in David Dennis. Maybe we'll just show you the videos. So David Dennis is very hard to learn. It's very hard to define a good reward here. You can see what PPO does. It does a little bit of crazy dance. And you can see what this black box formulation does. So here, the context is defined by the initial border trajectory, but also by the target, where you want to return the ball. So it's not fully accurate, but it's obviously much better than this jittery thing here. But one thing, actually, that is not really satisfying is this completely black box formulation that you only generate one trajectory, and that's it. So you cannot react to unforeseen events. So the question is, can you now do reblanning here as well? And it turns out that this is pretty easy, because the trust region layers are just a policy gradient formulation. So you can very easily add reblanning time steps where you re-select the motion primitive parameters. And with some of these motion primitive formulations, like this Pro-DMBs, that works very well because it still generates smooth trajectories. Doesn't really matter. So that's a very recent publication or submission that we had to JMLR. We call it Mb3, movement primitive-based reblanning policies, where you basically have two feedback loops. You have one inner feedback loop of the controller that's predefined. And you have another feedback loop on a much larger time scale, maybe only five times in an episode. You re-select the motion primitive parameters of the trajectory. And then you can do things like box-pushing. Again, the target, for example, is changing randomly during the episode. And you can already see that in for dense rewards, BPO has a lot of problems here. Doesn't really do much. I mean, it does something, but you see it's quite desperate. Why is this the case? We believe that this random change of the target position at some time step is adding a lot more variants into returns. And BPO already breaks, because it has already so much variance in returns by the step-based exploration. In addition, with this variance, it doesn't work very well. Re-planning here works much better if you would play. OK, somehow. Maybe your player doesn't like me today. OK, it works much better. Now you see that in the rewards. OK, so similar things for David Dennis, where you can basically see if you learn with re-planning, it also learns more efficiently, because now you have, instead of one sample per the trajectory, you have maybe five samples, depending on the number of re-planning time steps. And you basically see, again, there's a relation. If you do standard PBRL and with re-planning, it actually produces better policies that are a little bit smoother, because you have more training data. And you can also introduce perturbations, like wind velocities, that are unobserved. And then you can re-plan. And this also works with this re-planning approach that you can basically see here. Yes, and I think I'm running out of time. Could it be? Yes, OK. Then I will quickly finish. I have a few more slides, but I will not cover them. So you have them on the web page, so that's fine. But just to wrap up what we have seen today. So I started this off policy versus on policy discussion on that, which is basically trade-off between low bias versus more sample complexity. So on policy, it works very well with the low bias. You get better policies in the end, but it takes a lot of simulation time. And here again, it depends whether you want to optimize the simulation on the real robot. We have seen that maximum entropy, objective, it induces better exploration, robustness, to model mismatches, diversity. I was not able to show that anymore. We have seen that trust regions can be also implemented more in a more principled way using this convex optimization. So they don't require hacks and give you better policies. And that allows us to learn also with more abstract action representations in a more principled way. And that was my talk, and now I think it's time for lunch.