 OK, welcome back, everyone. So for this second lecture, we plan to have Andreas Kraus. But he got sick, so he's not here today. But we are really, really grateful to Felix here, our speaker for the afternoon, because he could talk over on this presentation in a really, really short time. So I think it's quite impressive. So he will be giving two talks today, this one and the next one. And we are very, very grateful. So Felix, not Andreas, sorry. He's a lead scientist at Bors, and he's interested in SAFERL and sample efficiency of reinforcement learning algorithms. And likely for us, he's also interested in model-based reinforcement learning. So let's welcome our speaker. Thank you very much, Felix. So yeah, so instead, you're stuck with me for the next three-ish hours. You get a break in between, so you get to recover. And so yeah, as was mentioned in the introduction, I'm interested in sample-efficient RL. And I will spend the next hour and a half convincing you that model-based RL is the same thing as caring about sample-efficient reinforcement learning. And so you've seen slides like this, probably excellent. Probably a bunch of times by now. So some agent interacting with the world. And I mean, we've been around for a while, right? So ever since these nature papers from DeepMind, this has been extremely popular. Just interact with some environment and can learn really, really crazy policies. And one thing that all of these require, essentially, is some simulator. So reinforcement learning started out with this motivation. Let's go on to the real world. Let's have a real environment. And then it turns out that all our algorithms, all our research is focused on simulators. And this is great. We can do crazy things there. But we really moved away from trying to do things in the real world. So how can we kind of actually go back to the setting that we're actually motivated by, right? Some real system that we can interact with. And one key component is data efficiency. And the problem now is, a lot of people propose to say, let's take some simulator, right? Let's build ever, ever, ever better simulators. But all of these simulators have the same problem. Namely, they are only approximations of the real world. And then you can go back and you can kind of retrain. But really at the core, we would like to have methods that can just directly be applied on real systems, right? Like we can skip this entire phase of building simulators and ideally just go to the real system and it'll work, right? That's kind of the holy grail of reinforcement learning. And so to some extent, model-based RL is doing nothing other than that. Namely, it's learning the simulator. So what we're doing is we're taking this real world system and we're replacing it by a learned simulator that's learned at the same time as the policy. And this is what's allowing us to be data efficient as we'll see over the next hour and a half. And the key question really in all of RL, as I said, is how can we explore efficiently, right? How can we decrease the requirement for lots and lots of data that we need to collect on the real system and instead be really data efficient with learning our policies? And this guarantees safety. Thanks Andreas for the pointer. That's going to be the next tutorial. All right, just briefly on the notation. Again, you've seen these slides before. So we have MDPs. There will be a little bit of POMDPs, but mostly we'll focus on MDPs for this talk. So we have states, X, so following the control notation. We have actions A, following the reinforcement learning notation. We have an initial state distribution that we start from. So we mostly will look at the episodic setting. And then we have some environment. And this is usually denoted by some probability distribution. So the probability of transitioning to a next state, X prime, given the current state and some action that we get to choose based on our policy. And our goal here is to maximize the reward. And this can either be a discounted reward. So we have some discount factor and we care about the infinite horizon setting. Or it can be a finite horizon task. So for finite horizon H. And interesting, like so a lot of the real world tasks that we train on, I mean the finite horizon tasks, but we train them in the infinite horizon setting. But really like either of them is fine. And so you've seen already a lot of one particular approach which is model free reinforcement learning. So maybe sometimes this name is a little bit misleading because you still sometimes have models in there, for example, for the critic. But usually what is meant by model free reinforcement learning methods are methods that directly use the data. So you have a replay buffer which you can think of as a non-parametric world model, right? It's just data-based samples of transitions. Use these to learn a value function, for example, and then use that directly to optimize the function. So it's model free in the sense that we don't need to know the environment or we don't need to have an approximation of the environment directly using the data. And so the other approach that we're going to talk about today is taking the other perspective. So if you think about MDPs, right? Like this environments, what if we were to actually learn a model that represents this and use that model then with another kind of set of RL methods, so either directly planning or optimizing with this model and then use the learned policy and plug that back into the real environment. So the hope here is that we can use a lot less data by using these learned models. And let me give you an illustration of how that looks in practice. So this is the standard reinforcement learning loop and now what we're doing is we're collecting all the data that we've collected during the interaction with the environment and we're going to do a very intricate policy update set where we start by doing model learning. So we're going to try to summarize this non-parametric replay buffer model with a parametric model that we can then evaluate. And the key of evaluation and the key difference here to model free approaches is that we can actually query this over multiple steps. So we can really generate off-policy data based on this parametric model that we learn. And at that point, essentially we have a new simulator, right? Like we can use this model unlike the environment, we can query it at every state and we can keep predicting forward and we can use that for policy optimization and there are different ways to do that and we'll see some of them. But on a high level, you can treat this now as a new reinforcement learning problem that no longer uses the real data but it uses the model data. And this policy, you then kind of plug back into your normal reinforcement learning loop, you collect another transition of real-world data and you iterate. So this is really the key inside of model-based RL is you collect very little data on the real system, you build an abstraction like an approximate simulator with which you can interact really, really cheaply. And once you have that, then you, for example, could apply normal reinforcement learning methods on that simulator. So the promise here is that by having a simulator, we can be much more data efficient, sorry, by having a learned model, we can be much more data efficient because we have kind of gained more access to the environment where before we can only interact with samples and it's expensive. Now we have a learned model that we can interact with freely. All right, so what are the promises? As I mentioned, sample efficiency is one, like little world data, build a model, generate a lot of more data with the approximate model and then run if you want data efficient algorithms based on that. You can go even further, like maybe you don't even want to use these gradient estimators, but if you have a model, you can actually differentiate directly through it. So what you could do is you could do a rollout over multiple steps and actually calculate an analytic gradient. So that's something that could be interesting. And then most interestingly, so you can actually now think about uncertainty, you can also do this in value space but it's a little bit more natural, I would say in the model space and use that to guide exploration. And we'll see some of this towards the end of the talk and a lot more of that in the tutorial on safety. And the last thing is the moment you start talking about models and uncertainty, it's also very natural to think about offline methods because suddenly the moment you have this learned model is not that different from the normal offline setting where you generate a little bit of real world data, it's fixed, you learn a model and then you can use this model to generate a new policy and you plug it back in. So it's just essentially offline RL in the model base setting is just one loop of this diagram I had drawn before. So it's great, right? So many promises, why do we do anything else? So as always, there are trade-offs. And one thing is that, or the most critical thing really is that by learning a model, we only have an approximation of the real world. And this is a problem because especially if you try to predict over multiple time steps ahead into the future, these model errors compound. And so this was a paper, I think in 2019, that was one of the first model-based reinforcement learning papers in the deep reinforcement learning setting. I kind of really showed that, hey, you can also use deep learning in combination with model-based methods. And what you see here is a prediction with the model and there's kind of the real movement. But if you predict ahead with your model, you have these kind of confidence intervals that just get larger and larger over time. And what this means is that your model kind of predicts lots and lots of different kind of futures and the further you predict ahead into the future, the more likely it is that very small errors in your model compound and then you're predicting completely different state distributions than those you would see on the real environment. So this is the key challenge we somehow need to overcome. And the second thing is, if you have a really cheap simulator or a cheap environment, you probably do not want to do model-based reinforcement learning. Because model-based reinforcement learning has this extra computational cost of fitting the model versus if you just have a simulator that's really, really cheap, then just use the simulator directly. I think that's a pretty good general rule of thumb. All right, so quick disclaimer, this is not going to be an exhaustive review of everything in reinforcement learning, a model-based reinforcement learning, but we'll try to focus specifically on, so to some extent, newer methods, so using neural networks and to the other end, probabilistic methods. So methods that specifically learn about the uncertainty in your model and try to exploit that to be even more data efficient in exploration. So let's talk about policy optimization as the very first part. So let's for now assume that somebody has already given us this model and you've seen already a little bit of how to learn models in the previous lecture on representation learning. So if you remember this Dreamer v3, that was just building some world model based on a representation, right? It had this autoencoder and then it was training a neural network to learn a world model. And we'll see a little bit more of that later, but for now let's assume somebody were to give us the model. What could we do with it? And here's a very generic overview of like how, as I mentioned, how like a model based RL algorithm works. So you start out with an initial policy and then you keep iterating between collecting data, adding that to your replay buffer. You plan a new policy based on the model that you just learned with this data and then you roll out the policy and generate new data. And if you look at some implementations, they will do this a little bit differently. They will interleave bottle learning and optimization, but on a high level, this is really the overview of how these algorithms work. And we'll talk about so the three things, the one that we just started with is kind of how can we plan, give them a model, we'll talk about how can we learn a model and then at the end, we'll talk a bit about how to do exploration, accounting for uncertainty in these models. And just to like a warmup, one setting where it's extremely well understood how to do this is actually the tabular case. And in particular, like there's these very classical papers that just try to learn models based on counts. It's really nice about this discrete setting in the tabular case, is that everything is independent of each other. So what you can do is you can just for each state, build estimators for your transition model based on how often transitions to different discrete states have happened. And so here you can start to build a first world model by looking at a particular state XT and a particular action XA. And then you look at all the instances where you've been in this state. So this is bottom count here, like how often was I in state XT with action and I've applied action A. And then you look at, okay, which of these instances did I transition to the next state? And this is a very, very simple first world model. And there's been a lot of work in this discrete setting then how to plan with these models and how to do exploration exploitation. And there are a bunch of references here. If you get the slides later, these references, there's a whole list of slides at the end, at the last slide, that point to the papers. All right, so this is the simple case and we'll try to push this all the way to neural networks and deep reinforcement learning. And so, like I said, first planning, then we'll look at learning models and then exploration. Good, planning. So when we talk about planning, I think there's really a massive literature that has looked at, given a model, how do I construct optimal actions in this? And it's really too much to talk about. There are lots of different domains. Like, do you talk about discrete actions, continuous actions in the partially observed setting? Things are a lot more different than in the fully observed setting. You can add constraints to it, which is what we'll do in the next lecture. And you can think about either very simple linear transition models, which are easier to analyze or then non-linear models, which become much more tricky. All right, so here we will focus on the continuous case. So we'll not do discrete methods so much. So I mentioned dreamers, so we're not going to do this dreamer setting so much. We're gonna look at other kind of state space models. We're gonna focus mostly on the fully observed setting. We will see some things about partially observed settings, but mostly it will be fully observed. And we will not talk about constraints. So as I said, there are two ways kind of that you could do this. And this goes back to what was told yesterday in the talk about primal dual methods. Essentially the truth perspective, there's this primal method where you directly use the model in order to plan a state distribution and you optimize the state distribution directly in order to maximize the reward. This is essentially the same as the primal problem that was presented yesterday. I mean, not in the kind of typical linear programming sense, but on some extent it's really you're also planning a state distribution and you're trying to maximize reward. And then there's the dual perspective, which just goes more into the model free setting where you really use this model directly as a simulator. So you only use it to generate data. And this is what a lot of algorithms have done. And then they apply any standard algorithm. So you could even apply an on-policy algorithm if you wanted to. But most people then, or like most algorithms so far have actually used off-policy methods. So use your model to generate data, plug that into SAC and you have a new model-based RL algorithm. And so we'll see a little bit of this later, but we will start out in this very model-based centric case. And just to give a little shout out to the control people, for linear models it's extremely, and quadratic rewards is extremely well understood how to do this. And there are actually solutions in terms of the linear quadratic regulator. The control community typically takes a different approach where they do system identification. So they learn the entire world model. And then they do optimal control versus what we do in model-based reinforcement learning is we learn the model as we go. So it's through interaction, which is a slightly different approach. But one thing that they share is model predictive control. And that's a key idea that was part of one of the first big model-based RL papers, namely the PETS paper. So here's the idea. Let's for now assume we have a deterministic model. So our next state is given by this some function F that's given the current state and the current action will tell us where we end up. And so our objective at that point becomes maximize the sum of rewards. So starting from like this would be an expectation over the initial state, but let's say we start in one specific state, then our goal is to maximize the sum of rewards, given an initial state, subject to this constraint that each next state is given by our dynamics model. And in particular, we're gonna not fit a parametric policy, but we're gonna directly optimize over the sequence of actions. And this is a bit awkward now in the reinforcement learning setting. We usually care about this infinite horizon and that we cannot do. So typically what people do is they plan over a finite horizon. So that's why it's called like often receding horizon control and model predictive control. And you optimize over a finite sequence of actions. You only apply the first of these actions. And then you go like one time step into the future and you replan. That's essentially the key idea behind these class of methods. You use the model directly to plan over the next couple of states. And then you have a plan of actions into the future. You only apply the first action because in the next state, things might be different, right? There might be noise. You might be in a slightly different state. So at that point, you suddenly need to replan and you will repeat the scenario. And so here's exactly this idea. So you start in a particular time step T and you observe the current state of the environment. Then you focus on a specific finite horizon age and you solve exactly this optimization problem and you carry out only the first action. And you redo this at every step. This is kind of the key idea behind these methods. And this goes back really to, started out in the 60s with linear models and now it's become popular in reinforcement learning too with nonlinear models. So here's an illustration of how this works. So let's say we have this goal there on the right and we have some constraints, for example, that we do not want to go. So you might start, what you want to do is you start out with a plan over a finite horizon and we kind of truncate here. So actually the optimization problem never sees this constraint. So that's the cost we pay by only planning over a couple of steps. And then you carry out the first step and suddenly when you plan over a finite horizon, you realize you can't go forward, but instead you have to go around the obstacle. So by replanning, we can suddenly take account for these kinds of constraints. And kind of if we keep iterating this procedure of carrying out an action and then replanning for the next couple of steps, eventually we manage to get to the goal. And so what's important here is that you see this is actually like a dense reward, right? That each time our agent gets to see the distance from the goal. And this is why actually planning over this finite horizon here works because the kind of direction where you want to travel is kind of given to the agent. So what's the challenge here? So we need to actually in order to solve for the sequence of actions, we need to solve this optimization problem here. And this requires us to repeatedly, essentially evaluate the model. So we start out with an initial state here, xt, and we apply the first action, go through a model, and that's in the next state that goes again into a model. So it's a concatenation of these models, which is also the reason why model errors kind of compound because we evaluate a model for every time step in order to predict ahead. And so what that means is really for each time step here, we need to solve this optimization problem here where there's a full trajectory of states given by the sequence of actions here and then we maximize the rewards over this entire trajectory. So this is really a method that plans over entire trajectories of states rather than only specific states. So it's a bit different from what you would normally see in reinforcement. And as I mentioned, if you have a neural network model, for example, in principle, you can directly compute gradients of this multiply evaluated model. So you predict ahead kind of the full sequence of states and you could do a backwards pass, so backprop through time through all the models in order to get a gradient for how you should change your actions in order to get a better trajectory. So this is unfortunately often quite expensive. Yeah, go ahead. I was just wondering, since you're planning only eight steps ahead, is the discount factor necessary? No, so it will be necessary later. So this is kind of like the very first instance and then we'll start adding value functions to account for this infinite horizon. For now we just truncate and then yeah, it doesn't really matter so much. I mean, you can have it, but it's not so relevant. Okay, so you can analytically compute gradients with backprop through time. These methods actually work, but they do have some weaknesses. So if you've ever tried to train recurrent models then you know that this is essentially some form of recurrent model that you kind of reevaluate. And these models suffer from problems like exploding or vanishing gradients and it might also just not be easy to optimize for the whole sequence of actions. And so what people have done in the deep reinforcement learning setting is they've relied on global optimization methods, so black box methods, evolutionary algorithms specifically in order to optimize the sequence of actions instead. So these are called shooting methods in the literature because essentially what they do is they propose a whole sequence of actions and they just see what is the return of this and then they adapt the sequence of actions based on it. So they kind of shoot into the future, figure out what's going to happen and then try to adapt this plan of actions by proposing new methods. So here's the key idea. These are randomized algorithms. So in particular one method that's usually used is this cross entropy method, so CMAES where you randomly sample all of the actions from a given distribution. Then you evaluate all of these samples and then you try to only pick the best of these samples and kind of repeat this process where you kind of keep randomizing what you had before, select the best of those and kind of start generating new hypotheses. And so this is what this would look like. So here's an illustration. You start proposing wild plans, like random plans for actions and then you figure out which ones are actually the best under my current reward model and then you take this and you based on this generate new plans, right? You kind of randomized locally around them and you keep doing this all the way until you find actually a really good plan that gives you all the way towards the goal. And this is then actually what you execute. Yes. So when the sequence of action is given by a neural network, what you sample is the weights of the neural network. So right now these are really, we haven't at all talked about neural networks. These are really the actions. And this is also how, so this PETS paper for example, that's also what they do. So this is how we act in the environment. So we don't actually need to have a parametric policy here, but at every time step we optimize a new plan with a policy. So this is computationally expensive of course, right? Like at every time step, we will actually have to run through this whole procedure of optimizing actions. That's another question in the back. Thank you. So if the sequence of actions, if the sequence actually determined like depends on the states as well. So does this actually have to do the, I don't know, iterative evaluation of the model as well? What was the last part? If the sequence of actions, each action depends on the current state as well, which it can, then does the sampling of actions according to some Gaussian process or something depend on the sequence of states from the transition model as well? So I mean, we're in an MDP setting. Like, I mean, we start out with a first state and then that determines everything. But yeah, so the sequence of actions does depend on the state. And we will see in a second, like, I mean, there's smarter ways than just completely random sampling the actions. Yes. But this is a very first method that's completely free of any policy, but just uses the model to directly plan for a sequence of actions. So this is very similar to this primal view in this linear programming setting where we directly optimize over the distribution of states, but over a finite horizon, right? And in the linear programming sense, it's actually the stationary distribution that we optimize. Here, we're also optimizing over distribution, not just over a finite horizon. All right, and so as you mentioned, so this can be really inefficient to really do this. And a lot of people have thought about how to do this more efficiently in the setting. And so one idea is to temporarily correlate actions. So the idea here being is that if you first go forward, then you don't want to go backwards in the next step, but you kind of want to keep going in some direction and this can help exploration. Then you can try to keep a memory of previous samples. So in particular, this is kind of useful if you re-plan at a new step. You might want to already use kind of this partial plan that you have into the future from the previous iteration in order to speed up this optimization. And then you can do some tricks of how many samples you draw over time in order to become more efficient. So these methods actually do work. So here are some examples where you actually use these methods to directly construct a plan of actions and solve concrete problems. So it's computationally expensive, but given a model, this allows you to directly optimize for what you care about. And so as I mentioned, there are some limits. And so one of the questions was already coming towards this with a discount factor. So we're planning over a finite horizon here. And so let's say we're only planning three time steps ahead, but we actually have only a sparse reward. So for none of these states that we can reach here, we're going to see any reward signal. So in this case, planning would fail completely because we do not account for the long-term effects of actions. We might make a plan here, but it has exactly the same reward as going here because we don't know over three steps. There's no way we will actually see the sparse reward. And this is now where classical value-based approaches come in because actually the point of a value function is essentially to tell you for each state what is actually the cumulative sum of rewards in that particular state. And so what people have started to do is to combine these completely model-based approaches directly planning with the model with value functions, but you only use the value function to bootstrap at the end of your sequence. So you plan over each steps into the future with your model. And then you rely on the value function only on the last state in order to account for this bootstrapping that you ignored before. So now we're conducting a plan, but we accounting for the long-term effects of finding us in a particular state and seeing what rewards we would get if we are in that state and would carry out with the same plan of actions. So this is trying to kind of move together this kind of primal model-based view together with something like a value function. And so in particular, one thing that's interesting if you actually set the horizon to 1, then what we're doing is we're directly trying to maximize the value of our critic, so of our Q function in this case, which is trying to, like, it's a value function combined with a model, which is saying, like, what would happen if I applied one action and what would my critic say which action is better? And this is essentially what happens in all of policy reinforcement learning algorithms. And you have a critic that's asking exactly this question for each transition that you sample. So you get a transition sample. You ask the critic what would be actually the best action in this particular case, and then in normal, for example, SAC, you would then regress a policy based on this action, AT. So there's strong connections between these model-based methods and these model-free settings. And yes, so there are lots of different ways to optimize this. So you could, for example, try to exploit the fact that you have gradients and combine them with these global optimization methods, but this goes a little bit beyond of what we have time for. So yes, it's a combination. Usually, this is what most people, like so one popular approach for model-based RL. If somebody gives you the model, you start planning sequence of actions, and you still bootstrap with a value function to account for the fact that you cannot plan over infinite horizons. Another question. What are the four of these? So the question is, how do we compute the value function? This would be the same way as in model-free. I mean, there are people who try to actually use the models for this. So for a particular state, they actually do a rollout with a model and use this then as an estimate to fit the value function. But I think in most practical implementations, what you'll see is this is kind of done the same way as in model-free methods. Before that, we're talking gradient. Yes, so temporal difference methods. All right. So far, we've talked about deterministic models. One thing that happens in practice is that you actually have stochasticity in your models and also in your environment. And at that point, it's very difficult to actually plan, like a very concrete plan, because there's a lot of randomness that can happen, and you need to account for this in expectation. And so we've already seen the main trick towards this. Let me do this one. So yesterday, Gary was talking about the reparameterization trick, and we can do exactly the same thing here. And so as Gary was saying, the key idea behind it is you have a distribution, and you want to sample from it in such a way that you can actually back propagate through it. And this is like a really important class of methods that essentially try to move the randomness from the distribution into some functions. So for example, for a Gaussian, what you often do is you have the mean and the covariance. And if you draw just a normal distributed sample, like without any particular zero mean univariance, you can transform that into a sample that has the corresponding mean and variance just by multiplying with your covariance matrix and adding the mean on top. And so here's an example. So often what we have is a Gaussian dynamics. So where conditioned on the state in action, we will have a neural network that tells us the next state has some particular mean and a particular covariance matrix that tells us where we will end up. And so what's often done is that you actually don't model the covariance matrix directly, but you model the Cholesky matrix of it, so the Cholesky decomposition. And at that point, you can now sample the next state based on just a normal sample from a normal distribution. And this is what I said. So you take the mean and you add up kind of some random scale by the Cholesky of your covariance matrix. And this is a way to draw samples from probabilistic models while still being able to back propagate through it in terms of the actions and also the model parameters if you wanted. All right, so now we are almost done here. So this particular transition model that we have, we can now reparameterize by this trick that we've just seen. And we can now do this reparameterization trick. So if you like using PyTorch, it's just constructing a distribution calling our sample on it. And it will give you such a reparameterized sample. And you can now do exactly the same trick where in addition, we draw noise samples as we propagate forwards. And this is one way to account for this randomness that we have in our transition models. And by drawing these samples, we can now get unbiased estimates of this value function, which is exactly like this objective that we're trying to optimize here. And so this allows us now to draw, get analytic gradients through the model and directly optimize it by using this reparameterization trick. All right. So there was already the question in the beginning. So why are we optimizing over sequences of actions? This is one particular way that one can optimize this. And it actually works really well because you're kind of combining the best of both worlds. So you're using the model to plan a little bit. And then you're bootstrapping with the value function. So you kind of get some agility here from using the models to plan. And even if your value function is then slightly wrong, as long as it kind of captures the long-term behavior, it's kind of good enough for this particular method. But of course, you can also fit parametric policies. And there are classes of algorithms that do this. And so the key idea here is really it's some form of a regression problem in these methods where you're trying. So for example, if you look at SAC at its core, what it's doing is trying to regress the policy on some proposal distribution of what an optimal distribution would look like under your current critic function. And you can do exactly the same thing in model-based settings. And the key idea is that you're trying to amortize this really expensive planning problem into a policy. And there are lots of people that try to look into this really from this amortization perspective, where you construct a plan and you directly try to regress on this plan. But you can also view model-free methods as just a one-step horizon version of these planning. OK, so this is concluding the part of just how to plan. Like I said, so two main methods, like one is directly planning for the sequence of actions. The other view is regressing a policy. And then there's this whole spectrum in between, where you can combine, for example, eight-step predictions under the model together with a value function. There are other ways. I don't actually want to talk about them in detail, because this will run out of time. So let's instead talk about model learning. So far, we've seen, given a model, there are lots and lots of different ways that we can generate an optimal plan. We can either use the model directly, predict the head into the future, and get actually a sequence of actions. Or we can kind of try to find a policy by just using this model as a simulator. So far, we've assumed this model as given. And now we actually have to learn this on the fly. And let's actually look at how people do this in practice. All right, so this is, again, this plan from the beginning. So we kind of construct, we get our current data, we try to learn a model, and then we do planning under this model. And the planning part, we just covered. So now we're going to look at how to learn the model. And so these are usually called world models, because they're essentially trying to describe your simulator, your entire world, your environment, and trying to exactly tell you what will happen if you apply a particular action. And so there are methods to do this in POMDPs. They're a little bit more involved. And so for now, we will focus on the fully observed setting, so the MDP setting, where you directly get states. And I will give some pointers at the end kind of how you do this for POMDPs. But essentially, it's very similar ideas. OK, so what's the idea here, right? So what's the key method that allows us to learn a model here at all? And the key observation is that the replay buffer already contains some form of non-parametric environment model, where you have states and actions, and then you have samples from the next state. So this is a particular non-parametric buffer that you could sample from in order to get samples from next states. The key difference is that we cannot plug in new actions because it's just fixed data samples. And so the world model is trying to exploit the fact that we actually have this temporal coherence in our data. So what we have is we have a sequence of states and actions. And what we can use here is the fact that we are acting in MDPs. So given the state, we know everything about our environment. And that means there's a given particular conditional independent structure here that we can exploit when fitting a probabilistic model. So in particular, if you're trying to write down the distribution over all states, give them all actions. It has this particular structure here where we have the distribution over initial states. And then if we go with t time steps, so this used to be h, now it's t. If we go t time steps into the future, at each time, given the particular action and a particular state, we predict one step into the future. So what we saw in the beginning where given a deterministic model, we can just predict one time step ahead. This is the same view from a probabilistic side where we're writing down the probability distribution over an entire sequence of states, given the actions. And we have this particular conditional independence structure. And what's really nice is that we actually, when we get a trajectory, we get a sample of this particular distribution. And so what this means is that we are reducing our model learning problem really to a problem like a regression problem. So given a current state and action, we want to regress onto the next state. And for that, we're using the fact that each transition here, state action reward next state, is essentially one labeled data point for world model. It's telling us, here was a sample transition given a state and action to the next state. And I want to fit my parametric model exactly to that data. So in particular, like before, what we're focusing on is these probabilistic models where the next state will be sampled from this particular model F. OK, so how do actually these models look like in practice? So we already looked at the reparameterization trick and these particular conditional Gaussian dynamic models. And in practice, what people do, and this is most of the methods, I would say, that are not dealing with POMDPs, we'll do it exactly this way, is that you have a normal distribution over next states, where you have a neural network that is parameterized by a parameter theta. And this neural network will output two things. It will output a mean and it will output a covariance function. And in particular, as with the reparameterization trick, people don't actually represent this covariance matrix directly, but they represent the Scholesky of it. So it has some advantages. So you have fewer parameters. And you can actually ensure that the matrix that your neural network outputs is positive semi-definite. So it's a valid covariance matrix. And the other point is that, yeah, so if you wanted to do the reparameterization, it's much easier if you learn a neural network like this. How do we deal with multimodal dynamics in this case? So this case will not. This is very much saying that your environment is unimodal. It's saying for each particular state, my next state is a Gaussian distribution over some mean prediction. And people have tried to extend this, so you can look at mixture distributions and all kinds of fun models. But at least for the typical benchmarks that we have, things like mojoco, this is actually more than enough. And I mean, in particular, mojoco is just a deterministic simulator. So in principle, you could even skip the noise. OK, but this is now we're getting somewhere. So we are starting. Now we have a neural network that actually predicts the next state. And we have parameterized in such a way that it actually models a particular distribution. And now that we actually have this probabilistic model, we already had the equation for the distribution over trajectories before. We can now construct a maximum likelihood estimate for these model parameters theta here. So as I said, mean and covariance are now parameterized by neural network that will output the individual components of the mean and the individual entries in your Cholesky decomposition of your covariance matrix. And what we're going to do is we're going to fit these parameters now here to maximize the probability of generating the data set. So a typical maximum likelihood estimate. And so if you actually write this down, what this means, so a map estimate just means that theta is now all the weights in our neural network. And what we're trying to maximize is exactly this probability here. So this is the probability of generating a particular trajectory under the model that we saw before. And then we have a prior distribution over the parameters, though most people will actually just set this to a flat prior and just a maximum likelihood. And now what you get out here is just a minimization problem that looks at each transition, plugs this into the neural network. You get a mean and a covariance matrix. And now you're maximizing the log likelihood of the next state that you've observed in your replay buffer, so from your data. So this is how we can start to learn world models just through normal inference, as you probably would have seen before. And yes, you can optimize this with SGD because we are in deep learning. And so here are some examples. So this is actually doing exactly this. So this is a robotics system, and this is a small toy car. And what they did in this paper is they actually just took the data, they tried to learn this world model, and then they just started to plan and optimize policies for these systems. And this actually already allows you to do pretty impressive things. I mean, you have to put really the effort in learning a really good model, but the moment you have it, you can actually start to execute pretty precise maneuvers. So for example, this braking here, which would be pretty difficult to actually do this if you didn't have a really, really good approximation. So just having a model and optimizing this can already get you really, really far, at least in these controlled environments where you can collect a lot of data. And so now comes the problem here. So in the previous example, that was cool, but a lot of effort went into actually learning that particular model. So one key challenge when you start to do these methods is what I said in the beginning, namely your learn simulator, so your world model is only an approximation of the real world. And that means if we really optimize for our model directly, we might be optimizing for things that are completely physically implausible. Because even like if I predict ahead over time, my model errors compound and I might be optimizing for the wrong things. And this is fine if we're replanning, but it's not fine if we're optimizing for a policy and then just deploying this on the real system. And so there is essentially one key answer to these kind of problems is to actually be aware of this model uncertainty. And so in particular, what methods do is they try to capture the uncertainty that comes from not having seen enough data. So in this example that I showed in the video, they collected a lot of data and they fitted a really good model and they could optimize it. What we do in model-based reinforcement learning is we start with very, very little data here. And then we start learning a model on that and that model will only be good locally around the data we've seen. If we use that to predict far away from the data, then we will get arbitrary predictions and we need to be aware of that. Because if we are aware of that, we can actually exploit this to do more efficient reinforcement learning. And so one option here is Bayesian learning. Okay, I'm not going to go through all of these equations again. But essentially what we did before is we just constructed a map estimate of the parameters. And in principle, you could try to model the entire distributions and that would give you a very natural estimate of uncertainty. And so in particular, kind of what typical methods do is you will have actually samples theta i from this distribution here. Like so for example, if you do MCMC methods, you get samples of the distribution of theta i and then you can approximate the expectation or approximate the distribution over next state by just a weighted sum of these samples. And Bayesian learning has been like a field for a long time and it's getting better all the time but what people actually do in practice is they will not actually do the whole Bayesian treatment but they will do approximations to this. So there are particular ways that we, I think there will be links to that to papers later, that try to for example, locally around, like through gradient descent, try to learn the distribution over parameters. But what I say what is even more popular is to rather than actually drawing samples in a Bayesian way is to use an ensemble. So you draw a bunch of random neural networks like whatever, 10 different world models. You train them on data and that is also a kind of representation of uncertainty that's really popular. All right, so how are we doing for time? So I have about 20 minutes left, is that correct? Half now. Half now, okay. Oh yeah, we'll start a little bit later. All right, so let me not talk about this but so one thing that's really important when you talk about these models is the difference between epistemic and aleatoric uncertainty. And so here's an illustration. So we have two things that are in model-based reinforcement learning versus we only have one of those in the real environment. And so aleatoric uncertainty is what we call uncertainty in the environment that comes from the noise. So this is typically when you build the expectation over returns in normal reinforcement learning, it's exactly over this aleatoric uncertainty. You're averaging out over the noise that comes from the environment and also potentially noise that comes from your policy. But what's important about this kind of uncertainty is that it's non-repeatable, right? Every time you roll out on the real system, you will see a different realization of that noise. This is what we call aleatoric uncertainty. So if you look here, there's some true system here. This is this blue line. And then there, like we see data samples of what the next state might be given or like a one-dimensional state action pair. And this is exactly what I mean by aleatoric uncertainty. It's randomness that resolves by just drawing samples, randomness in the environment. And the moments now that we start building models, what we actually get is we get a distribution over these particular models F. And so here, you were supposed to see a model, but what you just see is the mean function of this model. And it's an approximation of this true system F. And this model contains two things. So the first one is there's uncertainty about the kind of true system function F. It's what we call epistemic uncertainty. And this comes only from this distribution over functions conditioned on the data. So this uncertainty will go to zero as we get infinite amount of diverse data. The idea here is that by only seeing very few transitions on the real environment, we're not able to conclude perfectly what the true world model is, but there's some residual uncertainty. And as we collect more and more data, that uncertainty will decrease. In contrast, this aleatoric uncertainty, this will always be present. So the aleatoric uncertainty is what we just modeled in the Gaussian. The epistemic uncertainty is this distribution over possible world models that we were approximating with samples. So there's a really important difference because one of those we average over in reinforcement learning. The other one we need to somehow account for because it's now new in our model-based reinforcement learning setting. All right, and so here's one particular example. So as I said, people like to build these kind of ensemble models where you have lots of different parameters theta i. And for each of these parameters, you have a Gaussian distribution over the next state. And so just to draw the bridge here, in this model, the index i is what the epistemic uncertainty is. So it's saying all of these models could explain my data. And then this normal distribution, kind of given the particular parameter theta i, is the aleatoric uncertainty. So the normal distribution is modeling what's the noise in my environment. And these different parameters samples theta i, they model this epistemic uncertainty. What's going to happen in the future is uncertain given the amount of data I've seen. Okay, so in the interest of time, I'm going to not talk about some things, I'm directly going to jump to the PETS algorithm, which is one particular method that's, it was actually the first paper, that really popularized deep learning with together with model-based reinforcement learning. And they did exactly what we've seen so far, namely, so they took an ensemble, so they initialized, I think seven or so different neural networks and they trained them on the data, on the replay buffer, in order to get a distribution over models, essentially. So the seven different neural networks that they initialized and trained, they were their samples from the posterior that represented possible different dynamics models. And what they then did is they did planning with these models. So they, in particular at each time step, they would do exactly the CMAES method in order to plan a sequence of actions and only carry out the first. And then they would completely iterate this loop the same way we've seen before. So given new data, they would retrain the model and then their model they would already use directly through planning in order to act in the environment. And this actually already was the first paper that showed that model-based methods can at least be competitive with normal actor-critic methods and some points even outperform them. So not sure it beats, like, so this is here kind of one particular methods from pets is kind of this blue line. And so, for example, in this Reacher task here, they actually perform really, really well. In particular, if you compare this to methods like SAC, so the dashed line here is SAC at convergence. And it actually kind of, in some cases, can outperform these methods. And one key reason why it can do so is through this planning. So this expensive re-planning step buys you a lot of flexibility to kind of squeeze out the last bit of performance that might be difficult to represent in a parametric policy, but that you can squeeze out if you go this non-parametric policy way where you re-plan with the model. So this kind of was the first paper that showed that model-based RL actually has some value of its own. And so, as I mentioned, that would be pointers to POMDPs. So this has been extended to POMDPs, so dreamer being not one instance where they do it, but some papers have looked at actually modeling uncertainty also in POMDPs, and they've been surprisingly successful. But I kind of recommend to look at all of these papers that try to give you some idea of how one can start to build more complex world models also in POMDPs. And so the slides will be on the website later. All right, so let's talk about exploration. So, so far we've seen how can we learn world models? And given the world model, how can we plan actions in order to optimize expected performance? As I said, one thing that's really different about model-based reinforcement learning now is that we suddenly have this notion of epistemic uncertainty. So we've not collected enough data in order to really represent the model perfectly. And somehow that means that the model is a bit prone to being overexploited. And if we really were to go really off policy under the model, so we've collected data under some policy, we fit a model, and then we try to plan an action sequence that's completely different from the data, this would often not work so well. And so one thing that people have done in the past is to more or less ignore that fact and to just average out over this uncertainty. And so you would often see the expectation over all of these world models and trying to find a policy that solves all of these models and expectation. And so this part of the talk is going to be how to do something a little bit more better, or like a little bit better, that allows you to be even more data efficient and also allows you to be better at solving sparse and challenging exploration tasks. All right, so the exploration, exploitation dilemma is what you probably, I would hope, in the first half of this RL summer school, you would have seen quite a bit about this already. And so there are a lot of remedies to solve this. And then typically they resolve around adding some amount of randomness, but they're also actually principled exploration schemes. And so we'll do like a very short excursion back to bandits to give you an idea of why just ignoring epistemic uncertainty is a bad idea. And then we'll spend the last 10 minutes or so talking about how one can find good solutions to this also in the model-based reinforcement learning setting. All right, so banded settings. So suddenly we've simplified our rewards on reinforcement learning setting. Again, we're doing single step. We just have one state, essentially, and we get to pick the actions only. And essentially what's happening here is that we built a probabilistic model over some underlying function. That's kind of one way that you can solve these banded problems. And we want to use this uncertainty in order to find the maximum of this function. All right, so the question is, how should we sequentially query these particular arms or these actions in order to find the maximum of this function? And okay, so I'm not going to talk about these particular Gaussian process models, but one thing that what could try to do is one could try to directly ignore this epistemic uncertainty that we have about the function and just optimize for the expected value. So what's my, under my probabilistic model over the function, what's the expected return that I could get? And if you run this algorithm in the banded setting, what you see is that you get stuck in local optima extremely quickly. So essentially the first evaluation was bad, so it never evaluated that. The second evaluation, you got a positive return that was slightly better than the mean. And now even though there's all this epistemic uncertainty, it will never evaluate anything again. And so the problem here is that we're treating epistemic uncertainty, namely uncertainty about the true underlying function, the same way as aleatoric uncertainty. And this is also what has happened quite a lot in model-based reinforcement learning, where you just average over all your possible world models and try to optimize expected return under all of these models. And so you can theoretically show that this has exactly the same problem. I mean the banded setting is a particular instance of a reinforcement learning setting. So from that you can conclude anything that does not work in the banded setting cannot have guarantees in the reinforcement learning setting. So as I said, methods that kind of ignore epistemic uncertainty or just average over epistemic uncertainty will get stuck in local optima. So luckily, so you will probably have seen this. There are algorithms that come from the banded literature that can actually solve these problems properly. And so one of them is optimistic exploration. I'm not going, you've seen this, right? You've seen UCB, presumably. Yes, so the idea being figure out where things could be good under your epistemic uncertainty and focus evaluations on there. In particular, one method is to look at the maximum here. Another method would be to randomize by sampling functions and then picking the maximum of that. And that would be Thompson sampling. And I'm sure Claire gave a really good illustration on this. Didn't you do the banded tutorial? Yes, I did. Ah, okay. Okay, back to model-based reinforcement learning. Okay, so as I said, a lot of principled exploration algorithms in reinforcement learning actually take inspiration from the banded setting. So the banded setting is what we kind of understood first. And it turns out that a lot of these algorithms are more difficult to analyze in the reinforcement learning setting but carry similar guarantees. And so one of them is Thompson sampling. So in particular, so there the idea is we have a distribution over possible models. We sample one of our world models and then we try to find a policy that maximizes performance given this particular method. And so you can actually give theoretical regret bounds that these algorithms actually work. And so optimism is, so the equivalent of UCB is a lot more difficult to do in the reinforcement learning setting. So in particular, your problem is in optimistic reinforcement learning, what you would have to do is you actually have to maximize over this function class of models. Kind of what is the model that achieves the best possible performance? And this is actually really difficult to do to optimize directly in model space. And so here's one particular method that tries to work around this in a clever way. And so in particular, what we can do is rather than using these kind of our model class directly, so all possible models, what we'll try to do is we'll try to say, well, all of our next possible models, all of our next possible states under the models can be somewhere within the mean predictions of the models and then some notion of epistemic uncertainty around that model. And let me jump to the illustration, okay? And so here's the key idea. So let's say we have some initial starting state and we want to predict forwards with a model. So we have some policy that tells us which action to take and we plug that into our model and what we get out is a mean prediction of where we're going to end up. So it's this blue error. And then after one step, we can look at all the epistemic uncertainty in our model. So for example, we might have an ensemble of models and they will tell us what are the possible next states that I could end up with, given that I haven't seen enough data and this will give us this gray region here. And so it's a key idea for optimizing now also over this kind of model classes or like also with the epistemic uncertainty is to use the reparameterization trick in a clever way. Namely, we will define a new policy, like an additional policy new that is free to act within the epistemic uncertainty. So you have still, you have your policy pie that acts only on the mean and then you have this other policy that only exists in model space, right? It only exists because we have epistemic uncertainty and it's essentially allow to move freely within this epistemic uncertainty bound. So we now have two policies and we're just whenever, whatever this particular orange policy here, Etta, whatever that picks, that's going to be what we will say is the next state. So essentially, if you're designed a new policy that's free to move within this epistemic uncertainty. And we can repeat this scenario now where based on the state, we again, we select an action with our policy, we end somewhere with the mean and now again we get to act with our new policy that somehow gets to act within the epistemic uncertainty. And what's really cool about this is that even if I have a sparse reward, which is very, very, very difficult to find, if I allow essentially my new policy pie to control the epistemic uncertainty in a way, I can now plan sequences of actions that actually get me to the sparse reward perfectly. So unlike in the setting where we took the expectation over the models or we just pretended that everything was extremely stochastic, here we essentially accounting for the fact that epistemic uncertainty is different and we're essentially allowing our new policy new to control the epistemic uncertainty in our favor. So it's a particular instance of the reprimandization trick where we also optimizing over the way we are doing the reprimandization. And this is really cool because now if you're optimizing jointly over the policy pie and this policy that allows you to move arbitrarily in the epistemic uncertainty, you can actually show that this is the same thing as doing UCB. So you're being optimistic about your dynamics model because whenever you see epistemic uncertainty in your model, your policy will say, well, this epistemic uncertainty will work out always in my favor. And this is essentially the key idea behind how you can do optimistic methods also in a very practical way because this here now is just a particular our L problem under an augmented MDP. So we've added additional fake actions to our model. So these actions that come from ATA and we've defined a new model that doesn't have epistemic uncertainty anymore and it's just transitioning according to the mean. And then essentially we can move freely anywhere within our epistemic uncertainty. And by optimizing over both policies jointly, the resulting policy pie will be optimistic in the sense that it thinks that any epistemic uncertainty, any residual uncertainty in our model will always work out in our favor. So it's an extended action space. We have more actions for each state, but it's a simple way to do optimism using that allows you to do use still standard reinforcement learning algorithms. Maybe a naive question, but how would this differ from just picking the model that is the most optimistic in a sense from your ensemble? So the question is, what would an optimistic model be? Like how would you define it? If you couple it with a value function. Yes, so we will see this actually a little bit of that in the next tutorial. So there are methods that try to map actually the model distribution to the corresponding value distribution and you can act optimistically with respect to that. But it's essentially like, this is this primal dual thing again, right? Like anything you do in the primal setting, you can do slightly differently in the dual setting. This is the way to do it in the primal setting. It wasn't before that directly optimizing over the function space, right? Like you have this in principle instead of this maximization over actions, here you would have a maximization over all possible models in your model space which is tractable for certain model classes but not in general. And this is a nice trick that allows you to do this in a very practical way while still retaining guarantees. So guarantees get a little bit worse because it's kind of an over approximation but it's nice in the sense that it's actually a really practical algorithm that's really easy to implement. And so yeah, so I guess this was meant to mean that this is some kind of tool that you can just use. You can be optimistic by taking the reparameterization trick and for your epistemic uncertainty and optimizing over your epistemic uncertainty. Yeah, excuse me. Thank you. So like, yeah. So like if, sorry. So these mu t minus ones and mu hat t minus ones and sigma hat t minus ones are the reparameterizations from the already seen data, right? The estimates of the reparameterizations. Of the Gaussian dynamics from the already seen data. So just to be clear, so your question is, yeah, like how does this x tilde here? Not the x tilde, but the mu hat t minus ones and this is because, so like if you go back a bit, like yeah, this one. Mu t minus ones and sigma t minus ones are from the already seen data, right? It has to be estimated because we don't know. Yeah, so this is a little bit confusing in the notation here, but so this is now, like this is reparameterizing the epistemic uncertainty. Yeah, from the already seen data, yeah. And also can we vary the beta t minus ones like the learning, like a sort of learning rate? Yeah, so this is essentially telling you, allows you to construct some confidence intervals. Okay, okay. So it's a scaling parameter and the bigger that you pick it, the more uncertainty. Exactly, yeah. So it essentially gives you the bigger you pick it, the more exploration you will do, but you will be more likely that you actually find the up here. So in this case, does it give you some sort of guarantee on finding the sparse reward? Like if you pick an appropriate sort of schedule for the betas? Yeah, so there are guarantees for this. So it depends on two things. So you need to restrict the model class. So it needs to be a model class that's tractable to understand. So in particular, if this is like a Gaussian process model class, or really, if you want to be precise, RKHS functions, then you can kind of construct a particular sequence of betas that allows you to give guarantees for exploration. And if you look at these guarantees, they're not, I mean, these numbers, they just tell you it converges, I would say. It's like not the tightest algorithm, it's trying to be practical while still retaining guarantees. That was kind of the goal here. Here you are considering the marginal covariances. But if you are really applying UCB, you should consider the joint distribution. What do you lose by doing this? So like I said, I mean, there are ways to get tighter guarantees. This algorithm wasn't trying to have the tightest guarantees. It was trying to get something that's, you can actually practically implement and work. So it was trying to walk this gap between theory and practice. And so yes, essentially what we're saying is that at each time step here, I can look at epistemic uncertainty like independently, right? Like in principle, at this time step, I can go up, at that time step, I can go down. And so you do lose out by ignoring that, at least in terms of the theory side. In practice, I mean, these models here that give us the epistemic uncertainty are just going to be ensembles of neural networks. And so it's very natural to have a sense and so it's very natural to maybe construct marginal distributions from that. But it's a bit more tricky to really consider the correlations between this. So I think from a theoretical side, you lose out. I think from a practical side, this is empirically, I would say probably good enough. But I mean, as always, right? Like the tightness of these bounds here, like kind of if you can kind of decrease the size of these epistemic uncertainty bounds, then you would do better because there's less for you to explore. And I think that, like I said, this was trying to be practical while still having guarantees and you can do better from a theory side, but then you also need to find an algorithm of how to actually implement this. All right, so here's an illustration of this before we wrap up in a few minutes. So it's an inverted pendulum and we want to do a swing up. So it starts in the bottom position and then eventually you want to bring this pendulum to the top position. And as I mentioned, is we're planning these kind of trajectories in an optimistic way. So assuming that any epistemic uncertainty works out in our favor. And so you can see this here in the top right where initially, so this is here the angle on the horizontal axis and the angular velocity on this axis. And what you can see here is that given that our model is really, really uncertain, there's a lot of epistemic uncertainty and what we're planning is a path that goes directly towards the peak here. So there's this green star violating every known law of physics, right? Because without any angular velocity, it's changing the angle. I mean, this is great, but I don't know how to move a pendulum without creating angular velocity. So this is a very optimistic trajectory and if you roll this out on the real system, understandably nothing really happens, right? Like you're kind of randomly actuating the system. Your actions actually don't move your pendulum the same way that you thought because there was epistemic uncertainty. You didn't know enough yet. But what's really nice about these optimistic methods is that in that process, they collect interesting data. And so that I will learn now a new model and we construct a new plan. You can see that now this plan is starting to look a little bit more like a little bit more like the real world, right? Like suddenly we creating angular velocity and we're still getting to the goal in the optimistic plan, but while in the real world, this doesn't still doesn't work. And kind of if you keep iterating, I'll get better and better until eventually kind of after seven episodes, it will actually be able to swing up the pendulum as you would expect. And so this is really nice because it's actually a sparse reward task, which is actually difficult to solve. And we'll actually see that actually classical methods can really struggle with this. So this is the last thing I want to show. So your interior lunch break in like five minutes, so bear with me. So here's a normal half cheetah task and you can see like the normal greedy method. And so by greedy, I mean these kind of classical model based RL methods such that just average over epistemic uncertainty. And if you do Thompson sampling in these kind of model classes, it will perform just about the same as these greedy methods. But if you do this optimistic thing, you already gain a little bit of performance. And the reason that Thompson sampling is not working here so well is that these ensembles are just not a good approximation of your posterior distribution. They're just individual samples. They're not rich enough to really represent the full continuous uncertainty that you would need for Thompson sampling to really work. So at least in our experience. And this is just in the normal half cheetah setting. So we're already gaining something in the setting by just being optimistic and treating epistemic uncertainty the right way. What's really interesting and typically not done in reinforcement learning is to introduce action costs. The reason this is not done is that it completely destroys your reinforcement learning algorithms in sparse reward settings. Think about entropy bonuses that we add to our policies. The reason that they're there is that we need to have some stochasticity in our actions in order to explore. Now an action penalty is telling me that every time I try a random action that will decrease my reward. So suddenly your gradient is saying don't do anything at all in the sparse reward setting because any action you apply will completely kind of give you a negative reward. And so far I haven't seen any of the sparse reward that I could eventually get to. So also for methods like SAC or your normal algorithms they struggle in sparse reward settings with action costs because the signal that you get if you've never seen the sparse reward the only signal you get is don't do anything. And most are L problems you cannot solve by doing nothing. And so we're now going to increase you this action penalty and what you can see is that the greedy method is starting to perform worse while on the optimistic method you've seen this optimistic plan towards the sparse reward. This still sees that under my epistemic uncertainty I can move arbitrarily through space and eventually find the sparse reward versus the other methods. They're saying it's a really stochastic system with lots of epistemic randomness that's really difficult to control. They really struggle to at all find the return. And so if you start increasing this even more I mean the overall return goes down because we've introduced a cost, right? So maximum return can only decrease but essentially these optimistic methods still manage to go to the sparse rewards while classical methods will actually struggle in order to explore sufficiently and we'll just conclude that the optimal policy is to do very very little. All right and so here's kind of what will happen if you have a small action penalty like they will actually, yeah actuate the system a lot. I mean if you remember from these PPO videos that were popular like a decade ago you have these humanoids running around with their arms up in the air. This is exactly the effect because we're not penalizing actions because it makes our reinforcement learning problems easier. Now if you actually start introducing penalties then you're actually forcing your robot to be much more well behaved, right? So to do a lot less actions. So you avoid really aggressive control actions like that robot over there but suddenly your exploration problem became a lot harder because your reward signal is actively telling you not to explore, right? Any kind of random exploration gives you a reward penalty. And so in these kind of settings methods that really treat epistemic uncertainty the right way and do principled exploration they are actually still able to solve these kind of problems. And with that, so I'm 10 minutes over time which is not bad for slides I saw yesterday for the first time. So yes, there's lots of references towards the end. So there are really like a lot of extensions that you can think about. So one is transfer and meta learning. So when you're learning models it's very, very natural to think about transfer. So what happens if I have a multitask setting or I want to transfer to different environments? You can start to use models for conservative policy optimization. So what if I have constraints and constrained MDPs? You can do exploration based on these methods. No idea what model-based causal bandits are but maybe Andreas can tell you in the next summer school and in the safe exploration tutorial I will be presenting my own slides so it only gets better. And there we will actually use some of these epistemic uncertainty estimates to make the problem even more complicated. So we're not going to have only epistemic uncertainty and deal with that properly but we're also going to have constraints which really add an interesting dimension to this exploration task as we'll see. And so with that I'll be around for lunch. If you have questions I guess right now we probably don't have that much time for questions. Thanks everyone.