 OK, so welcome back everyone to the third lecture today. And once again, we'll be having Felix, who this time give a talk on safety reinforcement learning. Please, let's welcome him. All right, welcome back. Thanks for coming back. All right, so this time we'll take the model-based RL setting, and we'll try to push it a little bit further by adding safety constraints. And so if you look at this RL diagram, one thing is always that you have this environment and you can keep interacting with it. And our goal is ultimately to find a policy, potentially a recurrent policy, that maximizes reward. And so the question is, if you now try to go to a real-world system, so for example, you have some, like a car has a lot of different controllers in there, even if you don't do self-driving, you rely on controllers a lot. And the question is, could we just apply reinforcement learning to that? And in general, this is a really bad idea to do this. And there are a bunch of reasons for this. So the first one is data efficiency. So we talked about this already. But if we have a real system, we do not want to collect so much data on it that you don't want to collect so much data because it's a real system. Collecting data is really, really expensive. The second reason it's a bad idea to do reinforcement learning on a real system is that a lot of this relies on random exploration, like Gaussian policies, adding noise. It's just usually not a good idea on a real system. And I guess Gary told you something similar from a robotics perspective. The third reason is reinforcement learning by design doesn't have a notion of a test or training environment like a validation environment or something like this. We are given an environment and the task is to perform as good as possible on that specific environment. And this is great if this is the only environment you care about. But in practice, it would be really nice if small changes, like wear and tear in the system, didn't completely invalidate your controller. And unfortunately, if you solve for the optimal policy, those are typically not robust. So like a small change in the environment might completely derail your RL agent. And so the last reason why this is a bad idea is that these are actually physical machines. Like somebody has to sit in that car and drive it. So you don't want to just do reinforcement learning in there and not be sure that you actually, you know, still can operate the system safely. And this is really disappointing to me, right? So I'm like, I really like reinforcement learning and it's really annoying that we cannot, we're not at the stage where we can just say, here's a physical system, let me throw an RL algorithm at it and none of this will be a problem. And kind of, I want to ultimately, we should, I think as a community, we should get there to being able to do that. So what do people do right now instead? So if somebody came with this problem to you, what would be the state of the art solution? And the answer is that hasn't changed in quite a while. Namely, the first thing is you build a simulator for your environment, ideally a really high fidelity one. And then you go out and you randomize a lot, like just vary the different physical parameters and like try to find a policy that works well. And typically you don't want to find a stateless policy at that point because then you would have to find a policy that works for all of the environments. Doesn't really work so well. So you use some kind of recurrent policy. So what you're training for at that point is a policy that essentially does online system identification, trying to figure out which system am I interacting with and then act optimally once it's figured that out. And then once you go back to the real world, you just hope for the best, right? So zero shot transfer, maybe you do a little bit of training on the real system as well, but then you have the problem that you kind of lose the robustness properties that you instilled on the simulator. But this essentially was already the state of the art when OpenAI did their hand manipulation of the Rubik's Cube and there were recent papers on this. So also from DeepMind, this is still to some extent the state of the art. And so I think we should really get back to this setting because it's just not feasible to build simulators for every single small system that you want to control. Some of the promise of reinforcement learning is here's a system that you don't know anything about. You don't have to invest all the time to understand the system in detail, but the algorithm will find the solution and all you have to specify is what you would like the system to do. And somehow I feel like by working with these better and better simulators, we're getting away from this. And so for these four problems, I mean, people have worked on this quite a bit. And so in particular, the first two topics, like the question of exploration and reinforcement learning and also from the model-based perspective, that's trying to essentially address the first two problems. And so in particular, if you do model-based, in principle, you can try to find deterministic policies that you roll out on the real system, even if you were to optimize a stochastic one then in the model-based setting. And the second topic, so the question of robustness and safety during the learning process, that's what this talk will be about. In particular, we will focus on the safety aspect, not so much the robustness, but in general, when people talk about Safer L is this question of, can I deploy this in the real world without having to worry? All right, and we will talk about four aspects here. So the first one is, what does safety actually mean? Like how do you specify safety? What are different notions of safety? And there are various versions and typically the more general they get, the more difficult it becomes to do anything. And then we'll focus on one specific definition of that. Let me expect it safety. And we'll talk about how to actually act safely in environments like given prior knowledge and we'll see some model-based methods to act safely in environments. And then there will be the fourth aspect. So you already saw in the previous lecture that just doing exploration naively in the model-based setting doesn't work, right? Like the moment you don't account for the fact that you have epistemic uncertainty, your algorithms might do arbitrarily poorly. And so the moment you add safety constraints, there's at least from the theory side, there's something that on top of this that makes it even more difficult. And I think there actually there are a lot of open questions still that are still interesting and hopefully some of you will pick these up. So let's talk about what safety means. So let's say we have this robot here and we have some goal state that we want to drive the system to. So that's some notion of reward. And so what I can do is I can take a specific policy, I can plug this into the robot and I can have it fly this specific trajectory. And then what I want to ask is this question, was this trajectory safe? And let's say we have some safety constraint. Maybe there's some, this is some corridor and you're flying through that corridor. And our safety constraint is essentially distance from this wall. So we want that distance to always be positive so that our robot doesn't crash into the wall. And so people have spent a lot of time thinking about how best to specify this, but what we will use in this talk is essentially like this trajectory notion of safety for the most part. Namely, there is some function G that takes as input the entire trajectory and outputs you a scalar value that tells you kind of how safe this was. So for example, for us here, this might be the minimum distance to the wall that you see during an entire trajectory. All right, and like I said, people have thought about this a lot. There are like lots of kind of frameworks to reason about trajectories. In particular, there's this temporal logic that comes from formal methods. And they have like a lot of ways to define these kind of functions and reason about whether something is safe or not. But for us, this will just be now a continuous function that we can evaluate for a given trajectory and we can say whether it's safe afterwards. So for us then in the reinforcement learning setting, safety just means acting in the environment such that for any policy we plug in or any kind of sequence of actions that we select, that this safety constraint is satisfied during the entire learning process. So in particular, for we want to interact with the environment in such a way that the safety constraints are satisfied. So for example, our quadrotor doesn't crash into a wall. Okay, so the very first thing we're gonna do is we're gonna start by saying, let's fix actually a policy and let's just look at what it means to be safe. Like what can we now do with this function and how can we reason about its safety for like a new trajectory. And so the first thing that you realize is that we actually have stochastic environments. So either there's stochasticity in the environment itself, like random noise that happens every time I act. And often we also have stochastic policies. And as a result of this, every time I plug in, even if I plug in the same policy into the system and I start exactly the same state, what I get is this wild distribution over trajectories. So even though I've defined a function of safety, now I'm somehow in this stochastic setting and I still need to reason about, well, like, is there some probability of being safe or how do I account for this randomness and what does safety mean in this random setup? So in particular, this function G of tau given that tau is a random variable where we have every time we roll out, we get a new realization of trajectories. That also means that the safety function, as since it's now a function of a random variable, is itself now a random variable. And there are a lot of ways that you can reason about random variables and about their safety properties. And so the very first thing is that, I mean, we have a distribution now for a kind of any possible value of this function G. We have some distribution of realizations and we need to say, is this particular distribution safe or isn't it? And so the very first thing one could do is one could reason about the expectation. This is very natural from a reinforcement learning perspective because we really like expectations. We know how to deal with them. And so one problem that you have when you really care about safety is that this gear is actually safe in expectation, right? You just have one trajectory that's very, very safe. It's very far away from the safety constraint. And a lot of trajectories that just slightly violated. So the expected notion of safety can be a bit too lenient in some settings. So even though that this is behavior is safe in expectation, a lot of trajectories can actually still fail, right? As long as they fail slightly and there are some trajectories that satisfy the safety constraint a lot more. So people have thought about different notions of safety under these distributions. So this is kind of safe in expectation. This other distribution here, the blue one has kind of a higher mean, so it's more safe if you want an expectation, but it has a much higher variance. And so the very first thing that people have done then is, okay, let me actually define a measure that doesn't only depend on the mean, but also on other moments, so also on the variance. And so one way to do this is to look at the exponential of this random variable. So this is called, like in some literature, this is called risk, which kind of, if you expand it, like just plug in the definition of the exponential, what you get is still the expectation of G, so it's still expected safety, but then it also accounts for higher moments. And so in particular here, you'll see that there's some term that also penalizes the variance. And so this is now another notion of safety that one could use, which not only accounts for the mean, but also accounts for higher moments. And people, like you can go even further, right? Like this is still not like the very best setting you can do. You can get even more hardcore in the sense in the terms of safety, like you can become more and more risk averse, where people go beyond these risk definitions and then suddenly start using lower confidence intervals. And what I mean by that is that essentially the probability of performing worse than some particular threshold here, that's bounded. That's the idea essentially. Like look at the interval of the value G, essentially, such that below that the probability of failure, it's bounded. And that's called the value at risk. This comes a bit from the economic side of things. And then you can go even one step further where, so this is kind of the definition, so it's the smallest epsilon such that the probability of being smaller than that is kind of bounded. And then you can go even one step further, which is the conditional value of risk, which is essentially the expectation of your random variable G in this lower confidence interval. And these are kind of the typical definitions in the stochastic setting that people use for risk or uncertainty. So like I said, this is just kind of the expected value. So that's the definition. And then there's kind of, if you can go all the way to the extreme, like this is the most conservative you can get about any stochasticity is to say, all realizations of my random variable need to always satisfy this constraint. What this means is we're really looking at the support of the distribution and saying all of these random variables, where there can be no probability mass at all that kind of violates the safety constraint here. So essentially the probability that G is set as like is greater than zero is equal to one. So this is the most conservative you can get about safety. Or like I mean the alternatively in this kind of more trajectory wise notation, G of tau needs to be greater than zero for all possible trajectories that I could imagine. Okay, so this is kind of the rough overview of people what people mean when they talk about safety, right? So on the stochastic side, we have this notion of expected safety. So given a random environment, I can just look at in expectation, do I satisfy this particular constraint? We can look at kind of penalizing moments and then we can get more and more conservative where we start looking at confidence intervals and ultimately at kind of expectations over low probability events. And then on the other side, there's the whole worst case perspective, which is more close to line of robust control and these kind of topics where you're really saying that there's no risk kind of no allow, we don't allow failures at all. And this kind of then like I said, more this robust control and formal verification communities. So this is defining safety. So like I said, we have some notion of which states are safe and we define functions that evaluate kind of is this safe or not. And so as I mentioned, there are a lot of different definitions of safety on this slide. And typically the more you want to go like down here or like even all the way here to the right to the worst case setting, the more difficult things will become. And just from like purely, like if you were to have given a system with a policy and you were to start drawing samples, the thing is that getting a good idea for the conditional value at risk is really, really difficult because it requires you to draw a lot of samples, right? Because this all depends on very rare events versus the expectation. You can just draw relatively few samples from your environment and you get a good idea of what your expected value is. And so just from that perspective, the further down you go, the more difficult it will become to do anything. And so while there are some papers that looked at kind of these more formal definitions, like conditional value at risk or value at risk, I would say that right now probably the furthest away from being practical. Like risk itself, people have actually looked at this quite a bit, but not so much, I would say in the deep reinforcement learning setting. And so expected safety is the one that by far has the best practical algorithms, mostly because it's really, really close to the normal reinforcement learning setting. And so this is also what we will now look at for most of the lecture, like this expected notion of safety. Okay, so now we know what is safe and what's not. How can we start to actually act safely now in an environment? How can we act without violating constraints? And if we know the environment, then this is actually a pretty well studied problem. So from the control perspective, people have been doing this quite a while, where they assume that the model is given and then maybe they have some confidence intervals and you try to act within those. And there's this whole literature that's really trying to understand, give them an MVP and like a known MVP with known transition functions. How do I act safely in this? So for us, since we're doing reinforcement learning, the key challenge is that we don't actually know this in advance, right? We don't know the environment. And this puts us in a really awkward spot, right? If I were to just give you a random system and I say, please operate this safely and you know nothing about the system, there's no chance that you can actually act safely, right? Like so if I just throw any of you right now a controller for a quadrotor that's kind of hovering here and say, make sure that doesn't crash, the odds of that quadrotor not being inside that wall within like two seconds is not looking great for the quadrotor. So that's kind of the one key thing about safe reinforcement learning. If you really want to make sure that you start acting safely all the time and you don't violate your safety constraint, then you need prior knowledge. And this can come from either two sources. So either you have domain knowledge. So let's say you have already an approximate dynamic system that you can kind of locally at least you can control or there are methods to do safe imitation learning. So some human who can already control the system and you can try to find like learn a policy that imitates this without violating safety constraints. But one way or another, if you never want to violate safety constraints you need to have a good starting point, right? Like if your initial policy already doesn't satisfy the safety constraint, then this whole premise of never violate the safety constraint doesn't look so great. So that's the first caveat of safe reinforcement learning. We need some amount of prior knowledge. It's not as black boxy as normal reinforcement learning. All right, so now we have an initial policy and now comes the actual safe reinforcement learning part. Now we can start to think about, I have this policy, how do I do better without violating constraints? And so we'll, before we start talking about like really like actual methods that maybe qualify as reinforcement learning is talk a little bit about some of the things that people have done in practice as kind of a workaround. And so the very first thing is people have looked at methods where they have extra knowledge. So in particular from the control community, oftentimes like you kind of have a controller that's kind of good enough and can kind of keep the system safe, but it's not really good enough to like get the best possible performance. So typically this will be some linear controller or something like that based on a linear model. And this might be good enough to keep the system safe but it's not the one that will get you the best possible performance. And so what you can do there is to say, okay, let's I know actually what's safe based on some prior knowledge. And whenever my reinforcement learning algorithm now were to take an action that leads to constrained violation. So for example, if my RL controller would say, let's fly into this wall, then I'm going to kind of have some safety mechanism that takes over and make sure that you kind of to drive my system away from this. And then you can kind of keep acting as if this was a normal reinforcement learning problem. So the key thing here is that the learner is really seen as some kind of disturbance or some kind of adversary, right? The learning agent is kind of trying to actively destroy the system with random exploration. And then you build like a cage around it so that it doesn't crash. And this seems really appealing, but it has like some, so I mean, first of all, like a lot of work on this and this actually works in practice. It can actually work quite well if you have that prior knowledge, but it has some disadvantages. And the first one is you really need a lot of prior knowledge, right? You actually need to know which actions are safe and which aren't depending on your system that might be reasonable, but usually from an RL perspective, it can actually be quite tricky to really know this in advance throughout the entire state space. The second thing is, okay, so this is what I said, right? We need a significant prior knowledge. And the third problem, and this we will see is really a problem, especially in towards the end of the talk, the learning agent here has no idea what's going on. So from the learning agent's perspective, this probably was a great action, right? Like I put probability mass there, right? Like I want to learn something about this particular action. And suddenly it wasn't allowed to take this. So in the replay buffer, this action is not at all present. So it's very possible that the very next time you come to the state, the learning agent might still be very keen on taking exactly the same action because it's not learned about that action. So might still be very intriguing from the reinforcement learning side, but the RL agent is just not aware that it's not allowed to take this action. And there are some ways around it. So for example, I mean, Gary, for example, talked about these kind of differentiable trust regions and he can do similar tricks to project actions into sets. So for example, if you have some convex region that you say your actions are safe, you can project actions into that. And based on that kind of still differentiate through. So that works from an RL side, but it's still like if you just apply this naively, your reinforcement learning agent will get very confused. Okay, let's talk about an actual RL algorithm that allows us to tell whether a new policy is safe. So as before, we have an initial policy answer. So this is initial parameters theta B. And I want to figure out whether another policy is safe. And so in particular, we're gonna look at this performance measure as the expected return under the state distribution induced by some other parameters theta. So this is some kind of behavior policy, some data I've collected before. And this is now some new policy and I want to figure out whether this policy is safe or not in expectation. And so for our intents and purposes, this kind of return here, this is gonna be our safety function. So it's just some expectation over tau over this function G of tau. So we're just gonna change the objective a little bit. And what I want to figure out is whether this expected value is larger than whatever safety threshold I've specified. Without ever collecting data under this policy though, we've only collected data under some initial safe behavior policy. So our safety constraint here is that you want to make sure that with high probability, the expected value under the new policy parameters is bigger than the expected value under this behavior parameters. So I want to kind of ensure policy improvement with the main difference that for us now, policy improvement really means not decreasing on the safety constraint. So really making sure that this safety constraint is not violated. So the key challenge here is essentially one of policy evaluation. I'm given data from one policy and I want to figure out what's the performance under some hypothetical new policy. And you can actually use classical methods for this. And so in particular, the question is here, like we have one distribution from the behavior policy, how different would be the expected value of my safety measure under a new distribution of actions? So a new policy. So exactly this question, what does this data tell me about the trajectories induced by different policy? And so there's this one key trick that you've seen probably over and over again at this point, which is important sampling. It shows up all over the place in reinforcement learning. And the key idea is to say that the expectation under this new policy here is the same as the expectation under the old policy just multiplied by the importance weight. And this is essentially looking at how probable was the distribution under my old policy in comparison to the new policy. And in particular, what's really nice about acting in an MDP is that these distributions here over trajectories, this ratio only depends on the actions because the state transitions, I'm giving you a trajectory. So the state transition probabilities under the environment, they're exactly the same on the numerator and the denominator. So they actually cancel out. And this is just the product of probabilities. And there are way better ways to do this. So this is like the very naive way to do important sampling. You can exploit more structure about the way that we interact with the environment. But in principle, you can actually do this kind of methods in order to figure out what's the expected value. So the last thing we need, right now we talked about expectations. We haven't talked about guaranteeing safety, right? Like in particular, the question is how many samples do I need in order to really ensure that my expected value is better? Right, like given this important sampling expectation here, I know that if I draw infinitely many samples, great, if I could actually calculate this expectation, then I know that my expected value will be a certain value. But how do I do this in finite samples? And for this, there's a, we get an unbiased estimate of the expectation right now. But like the question is, can we really guarantee that with high probability, we're going to be at least as safe as the previous policy? And for this, people have looked at concentration inequalities. So essentially what you have is you have an expectation and you can draw samples from this particular distribution. And there are lots and lots of concentration inequalities that you can throw at this problem. And they will give you different bounds that look a little bit like this. So essentially they will tell you the expected value under your new policy pi, so the new parameters theta, is greater than equal than the important sampled property under a finite number of samples. So we have big N sample trajectories. And then you have this term here on the right, which subtracts some kind of quantity that depends on both the probability with which you want to make sure that this inequality holds. And also on the number of samples. And I don't really want to dive into detail, but essentially this is a first method that's completely model free. And it allows you to say given data from one policy, how well am I doing with another policy just an expectation? So this is a very first method that you could now in principle use in order to figure out whether a policy is safe. And this leads kind of this general framework for safe reinforcement learning where you kind of get trajectory data from your policy, and then you construct two different sets. The very first is the training set, and from this you get some candidate policy. So a policy that somehow you hope is better and you want to figure out if it's safe. And then you use the test set in order to figure out whether this policy is actually safe. And the reason why you need different sets here is that otherwise, like this relies a lot on independence assumptions. And if you were to kind of evaluate on your training set, then suddenly you've correlated your problem and suddenly you can no longer use the same expectation, the same concentration inequalities. And once we've evaluated safety and we've confirmed it's safe, you can now also use this new policy. And these are essentially one of the two papers or like two of the, I would say biggest papers in that, that looked at essentially this policy improvement with actual guarantees, right? Like really constructing high probability bounds that you will do better in expectation. And while they looked at rewards, you could use the same ideas for safety. So how do we actually generate a candidate policy, right? So far it's been about essentially just methods that try to use samples and guarantee safety. We haven't at all talked about reinforcement learning. So this is the point where we're going to switch to actual ways to generate policies. And so the very first method we're going to look at is called constraint policy optimization. And the idea is essentially what we talked about before. So we have some MDP. We want to act in that MDP subject to a constraint where in expectation we want to some other value functions or some other sum of returns to be upper bounded. So it's a bit of a strange notation here because it's now a upper bound where before we were talking about lower bounds. But in this formulation, essentially, it's really convenient to flip the sign because then we can kind of treat this as another value function. So you can treat on a high level. What we want is we want to maximize performance. So expected return while making sure that some other expected return is below some particular threshold. So what's really important here, this is expected safety, right? So on average, we want to make sure that these stochastic rollouts don't accumulate some, well, in this case, it's a notion of cost higher than essentially this threshold and without less of generality, that's just zero. And this is just your run of the mill constraint optimization problem. And there are only so many ideas on how to solve constraint optimization problems. And almost all of them go via some Lagrangian dual and try to solve that through various different approximations. So the core ideas are kind of very similar actually to classical methods like whatever trust region methods where you're also similarly trying to ensure that you don't violate some constraint. And there usually is the trust region on the policy. Here we now have a new constraint that tries to look at these value functions. But fundamentally, this is just an algorithm that you can implement and it will learn kind of a value function that will try to satisfy this constraint during the learning process. It doesn't guarantee safety. So you still need this entire framework if you really wanted to make sure that policies are really safe. You really would have to go through this framework of drawing new samples and then validating that policy because this here, the solving this relies on a bunch of approximations. But at least in principle, this is now a very first algorithm that tries to find a policy that eventually is safe. So we kind of satisfy the safety constraint eventually through the training process. All right, and yeah, so the assumption for the reason why you can't guarantee safety is that the assumption here is essentially that you know the critic function which in practice you don't and then things go astray. So this actually concludes part one. So we've talked about completely model-free methods here just drawing data and trying to improve performance. And so we started out by reviewing some safety definitions. So we had like the stochastic variance with expected safety and like building confidence intervals and like looking at first and worst definitions of safety. And then we saw how to actually obtain a first safe policy. So either through prior knowledge or by doing imitation learning. And then finally, we saw like a very first method to do safe learning expectation. And so now the second half will actually be how to do safe exploration in practice and we'll take a very model-based perspective on this. So in particular, what we're doing now is we're gonna do model-based safe reinforcement learning. The reason why this is interesting is that a lot of the problems with the model-free methods came from the fact that in order to kind of verify safety I now had to collect new data from the environment. So these methods are even less data efficient because they can need to collect data just to figure out whether things are safe or not. And so in the model-based setting, at least in principle, you could use your model to collect that data or even use the epistemic uncertainty estimates in your model to circumvent this entire problem. And so you've seen this slide like a very similar slide before. So this is model-based reinforcement learning. So you learned about that. Yes, question. So this really depends on your problem. So the question was, does it make sense to be safe in expectation? And it really depends on your problem. If your problem is extremely stochastic, maybe not. If it's closely deterministic, like a Bujoko environment or so, then it's probably fine. But it really depends. And I mean, this is an expectation over the noise in your environment, right? It's not an expectation over whatever, like epistemic uncertainty. That's really important. But yeah, so in principle, it would be really nice to be extremely safe. But as I tried to say in the beginning, right? Like these, if you're trying to be very safe with respect to array events, then your requirements for collecting data will go through the roof. Because to see these array events, you need to collect a lot more data by definition, right, the array app. So short answer depends on your problem, but the more safe you want to be, the less data efficient it becomes. And so if you want to be really conservative, then you will need more data, which is why people have to focus on expected safety also because it's easier. But then, as you said, simulations are getting cheaper. So why couldn't we use simulation to solve this specific problem? To just collect way more data because it's super cheap too. Because simulations, as good as they are, are only in approximation of the real world. So if you have a perfect simulator, so if you essentially say this is the real environment, then there's no reason not to. But the fact is that most problems, we do not have perfect simulators. And just because they are building simulators for specific problems, I guarantee you that they're a lot more interesting problems that do not have good simulators. So it's not just robotics. Yes? No, I mean, you can always transform an expectation guarantee into a high probability one. So I mean, if 90% not hitting the wall is enough for you, for finite horizon or for discount horizon, that's equivalent to just solving the expectation problem. So for some things, it's not almost sure, but it's maybe close enough. So back to model-based reinforcement learning. So as you've seen before, we're collecting data. We will now do kind of model learning. We've seen this with some neural network. Or you have a follow-up? If you have an expectation, why don't you just put this in the reward, like penalize and save events in the reward? Yes, so these are, like Rich Sutton would have asked exactly the same question. Like, don't mess with my framework. Why do I need constraints? Just put it in the reward. And yes, you can do that. The problem is, then suddenly, your rewards are typically, like if you really want to guarantee safety, then you essentially need to make sure your problem is essentially infeasible, or like minus reward, infinity reward, when you violate the safety constraints. And that actually makes things a lot more difficult. So these kind of very discontinuous rewards are actually more challenging. And if you don't have that, then you need to trade off between kind of how much do I weigh the safety factor versus my reward? It's also non-trivial. So yes, in principle, you can do that. And if you're in a simulation, that's fine. If you want to go to kind of real world setting, I mean, also empirically, adding constraints and treating them separately just works better. So we do model learning, and then we do safe policy optimization. So it's exactly the same setting as before, right? So model learning doesn't change at all. What will change is how we do policy optimization. So there, we now have to additionally think about these safety constraints. And so we've seen something similar to this before. So we're going to assume that we have some dynamics models. So here, it's deterministic. And in principle, this can be a stochastic tool. And we have some notion of model uncertainty. So here's as an example, this is a Gaussian process. Actually, what follows we will mostly be using ensembles, as we did in the previous talk. And the idea is that as you collect data, you become more and more certain about your particular dynamics function. So the epistemic uncertainty, so this blue uncertainty here, as you collect more data, you become more certain about which particular environment you're interacting with. That's kind of the key idea behind these model-based methods, that as you collect more data, epistemic uncertainty shrinks. And what this means is that ultimately, you can think a little bit of this of essentially having some set of models. So like some class MT that contains all the possible dynamic models that are compatible with your epistemic uncertainty. This is kind of in a Bayesian setting. This would just be you have some distribution over models, so posterior distribution. And then you look at high probability regions of this particular set. All right. So this was the setting that we were before in CPO. So in particular, the other goal was to find a policy that maximizes performance subject to this expected cost constraint under the true environment. And now we're going to switch to the model-based setting. And there we can do two things. So the first one is this is kind of optimistic exploration. So we are kind of also maximizing over the function class. And then this is why it's convenient to suddenly write this constraint as an upper bound. Then here we're also kind of trying to make sure that for all models in this function class, this expectation should be below a particular threshold. So we've done nothing other than replace this model-free setting with now model-based setting where we explicitly account for the fact that we now have access to our set of models and that they contain the epistemic uncertainty. And if kind of for every model in this epistemic uncertainty, we can guarantee that the safety holds, then we're fine. And we know that if we apply this on the real system, this will be safe just because the assumption here is that the true model is in this model class with a certain probability. And so what's really important here, and I want to stress this, this expected safety here is over the aleatoric noise. It's the noise in the environment, randomness that is non-repeatable. The worst case, so this maximum here, is over the epistemic uncertainty. So we're saying for every model kind of within our model class, for every model that could be explained with the data, we want to make sure that the safety constraint is satisfied. So let's look at this kind of in an iterative way. So here's the setting. So we have some states against time. And if we start acting from a particular state, essentially for a given model, we have a distribution over possible states. Yes? I had a question about the previous slide. So why don't we learn with the worst possible model? I mean, this is what this is saying, right? Like this is essentially, there's my model class, and I'm just saying that the worst case model should still satisfy the safety constraint. OK. What I was thinking to be a mean book. Yes, so this is not an expectation, but this is a max. This is really like a worst case over the models. And so essentially, you control your model class through however confident you want to be that your model is in there. And then for all of those models, you want to make sure that an expectation over the aleatoric uncertainty that the resulting model is safe. So this should be different models, right? So I mean, we could completely cross this out and say we have any kind of RL thing. Really, we care about this here. But it makes sense to be kind of optimistic. So I mean, OK, let's not dive into there. So what happens in practice is that this model class is actually finite. So at least in most practical experiments, and then this works. Yes, if you can get into other technical problems that I don't really want to dive into right now. OK, so pictorically, this is what happens. So essentially, we have a model and this encodes some distribution over states over time. And so again, this is for a given model P from our model class. So the small P, we can look at somehow this expected return here. And this is just for now, expected return and expected safety here is going to be the same thing for us. And that means we have a corresponding critic function here for this particular red model here. And now we actually have a distribution over models or a finite class of models, depending however you parameterize your epistemic uncertainty. And that means we have a lot of different distribution over states. And that also means we have a lot of different critic functions. So for each model, we have, in principle, some notion of what's the expected return for both here under this red distribution or under the blue one or the purple one. So for each of those, there's this notion of what's the expected return. And now what we can do, if this is actually a finite class, as I said, like most people like to use ensembles or you have some distribution and can draw samples from it, now we can actually take all of these critic functions for each model and we could take the worst of this. So kind of the maximum over all the critic functions. And this was this idea in this constraint policy optimization by a Bayesian world models. And they actually do this in the POMDP setting. And then essentially, it's the same. You can use constraint policy optimization as before, just this time with the worst case critic under your respective model. And so this actually works really well. So this is kind of an example of the kind of POMDP environments they tried this on, also kind of whatever, roaming around while not touching these blue boxes. And you can see here, this is like, I think, the leftmost one environment here. You can see that, especially compared to kind of other methods, even at convergence, this works really, really well. So kind of the blue one here is their algorithm called lambda, which is essentially trying to learn different critics. And then there's an unsafe version, which doesn't consider the safety constraints. And the unsafe version gets higher performance, but it violates all the safety constraints that they've defined. And then if you add the safety constraints, then you get kind of similarly good performance, like you lose obviously a little bit by adding a constraint. But you also actually satisfy the constraints. So you have really low kind of average costs. So this actually goes close to the error. So this is really cool. So it's actually a very first method that tries to actually map model-based uncertainty to value uncertainty, and then uses kind of classical algorithms by reasoning then in value space. So this is actually a really cool first method to map model uncertainty to critics. And so one thing that one could ask is training this per model critic is actually really annoying, right? Like suddenly for each possible model, you have to train a critic function, and it gets a little bit messy. So one thing that one could ask is if we have a distribution over models here, then there's a corresponding distribution over value functions, right? And so these critics that we saw before, they were just samples from this entire distribution. And so the question here that one of my PhD students worked into, so Carlos, was can we actually learn this full distribution? So if somebody gives me a distribution over models, can I learn the corresponding distribution over value functions? As a particular, right? Like so value functions, again, they're expectations over the epistemic uncertainty. So I'm only trying to map aleatoric uncertainty here. That's kind of the key difference towards what other people did before. And so in particular, we're gonna use ideas from distribution or reinforcement learning. And I'm not sure that anybody talk about the distributional RL before in the summer school. Okay, so either all of you were asleep or the answer is no. The idea is actually super simple. So in particular, you have this distribution and you somehow want to approximate this with some function approximator. And the easiest and like empirically successful way to do this is to approximate this with quantiles. So in particular, you just take like in this case seven different quantiles. And for each of these quantiles, you want to make sure that the probability of your function being to the left of this, so lower of this, is equal to essentially the quantile number. What's the idea here? You kind of, you take the CDF and you essentially discretizes at this certain probability levels. So that always each individual interval kind of accounts for a certain amount of probability mass. Okay, and then what you're doing is you will try to construct some kind of loss function that for each kind of quantile maps it so that it's kind of really has its minimum. So these are convex functions, so that each quantile has its minimum at the true location. And the derivations usually look quite complicated, but the idea is really simple because usually what you end up doing after all this derivation is you update each quantile based on samples of values. And in particular, what you want from this definition is that essentially like tau over i, so your threshold here, like you want those to be, the probability mass to be below that, you want that to be equal to the particular threshold value that you have. And so the idea is essentially to look at these intervals here and when in expectation, when like whatever like for the very first interval, you want like 10% of the intervals to samples to fall into this interval. And so you contract a particular update so that in expectation, only 10% of the samples map to the left of this. So it's a little bit complicated to explain in one words, but if you think about this equation, essentially what it's doing is that most of the samples for this particular tau one, it will actually try to map them here to the left. So this is the indicator function here. Sorry, we'll try to map them to the right of here and only 10% to the left here because if you'd put an expectation over this, that's kind of the, these terms will cancel out. So it's a very simple algorithm that just uses samples in order to approximate a particular distribution. Yes. Yes, so these are kind of direct deltas. So I'm approximating this largely essentially by a sample, well, direct deltas that have a certain probability mass below it. So the seven points are sampled? Pardon? Like these seven points are sampled or are they fixed? No, I mean, this is the true distribution, the gray one, and I can draw samples from this. But these are kind of fixed values so that this equation is true, right? So the probability of having a value below that should only have a certain probability mass. So how do we approximate it in between? So right now this is just like, so these are quantiles, right? So it's a discrete distribution at the end? Yeah, quantiles that fall in between. Yeah, so I mean, this is then up to you, right? Like you can interpolate in whatever way you want. I mean, typically it's in just a box, right? So when you say distributional RL, do you like, my understanding is you consider the distributional value functions? No, so normal distributional RL will try to take an environment. It's not in the model-based setting and they'll try to estimate the return, the stochasticity in the return just from aleatoric uncertainty. And what we're doing here is we're in the model-based setting and our samples are actually value functions. They're already averaged over aleatoric uncertainty and the only uncertainty that's left comes from the epistemic uncertainty in our models. So it's using the same ideas but it's using it for epistemic uncertainty only rather than aleatoric uncertainty. And if you were to do just do a lot of rollouts from my model and do distributional RL, you would mix aleatoric and epistemic uncertainty. And we'll see in a second that this is a bad idea. Okay, so this relies on samples of value functions and this is essentially the last insight. How can we generate a sample? Well, by first sampling one of these kind of values here that we distributed at random, so tau j. And then we just do bootstrapping, right? So essentially we can generate a sample by, for each transition that we have, sample a random target value and look at this under the particular model. So models and kind of the target values are sampled at random. And this will actually recover the true distribution under the assumption that essentially the models and the values are independent, which most of the case they're not. I mean, for each model you usually have a value function but in the discrete setting where you don't revisit states, this is actually true. And this essentially goes into, these are similar assumptions as people typically do for these uncertainty-Bellman equations. So now that we kind of can approximate one distribution, essentially the last step we need is we will have a neural network that outputs the parameters of the distribution. So our neural network will just say that all the different towels given a state will tell us what are the different quantiles of the distribution. Yes. So when we are selecting a policy, we're still taking the max over all actions, right? So how does the distribution part of your value function comes into picture? So I mean, I haven't talked at all about what to do with this distribution now, right? And so in principle, now that you have the full distribution, you can do exactly the same ideas as before, right? So before we took the max and you could now, given the distribution, you can use any measure that you want, right? We actually have a full distribution. You can now say, let me also take the minimum of the support of the distribution, so worst case, or I can do risk. By actually having the distribution, you can actually try to, like, be a lot more flexible in terms of safety definitions that you can have. I see, so you won't necessarily take the max if you wanted to. Yes, so actually we haven't looked at this, so this is relatively new. We haven't looked at this yet in terms of safety, but we'll see for exploration. We'll see that there it actually can matter a bit like how you use that distribution. So essentially the next slide. So this is now looking just like I said, this is not yet in the safety critical setting, but so here you kind of see the performance of SAC on kind of the quadruped run and the walker run DM control tasks. And so you can run Mbpo, which is one of these typical algorithms that just uses the model, generates random rollouts and then applies SAC. And this is averaging over epistemic uncertainty. And then what you can do is you can use an ensemble of Q functions. And this will already do slightly better because there's some notion of epistemic uncertainty in your Q functions. So what you can then do is you can actually do Mbpo, but you just do quantile regression. So you do exactly what I did before, like I said before, but it's essentially distributional RL, mixing aleatoric and epistemic uncertainty, some both kinds of uncertainty and approximate this with quantiles. This is essentially this green line here. And this will actually already do better. Somehow these quantile methods are much better at estimating the mean than just normal mean squared error. And then you can go one step further. And so this is kind of the blue one and the orange one. These are the methods that I just talked about. So there essentially the idea is really approximate the distribution over value functions. So you're averaging out aleatoric uncertainty and use distributional RL as discussed before. And this will do even better. And then you're free to choose your expiration criterion. And for these particular environments, since they're not sparse, so somehow at least in our experiments, the mean actually does reasonably well in some environment kind of this year is optimism. So it's trying to look at an upper confidence bound on the value functions. This will actually do a little bit better in some environments. But so at least for these environments, it didn't seem to matter that much. Actually the mean estimate seemed to be the most important bit. So this was model based. So what we talked about right now was model based uncertainty. And then we mapped that through some way either by learning different critics for each model or by trying to learn the full distribution of our values, try to map that to value based uncertainty. So the natural next question is, can we actually do this directly in model space? And this is now going back to the previous lecture where yes, you can. So in particular, I just very briefly will cover this. One thing you can do is, let's say I have this kind of, if I predict a head under my models, I get these very complicated distributions potentially. It can be multimodal. And one thing that one can do is one can over approximate them with something tractable like ellipsoids and then try to use planners that can deal with those. So in particular, constraint optimization with these kind of ellipsoidal constraints, there are some methods to deal with them effectively. And so essentially here, the idea is you're kind of reasoning again in model space over multiple steps and you make sure that these ellipsoids don't collide with certain constraints. This is one way. So you can use any planning methods. So for example, MPC in order to use these. The other way is we talked about optimism. So HUCRL as a means to optimize over functions. You can do exactly the same thing now in the safety constraint setting, where again, every time you predict forwards, there's epistemic uncertainty and then you have another policy that allows you to move freely in this epistemic uncertainty. And again, you can do this over multiple steps. And now you have like a max min problem where you want to minimize over this add up distribution but it's exactly the same idea and there were some papers that tried to do this. But essentially it builds on this idea of optimizing over functions by reparameterizing the epistemic uncertainty. So what we just talked about in value space, you can also use the same methods that we had from normal model based reinforcement learning. It's just that you then have these kind of more difficult optimization problems. Yes. So this is what I tried to talk about at like slide 10 or so. When you have prior knowledge and you can use that prior knowledge in order to say which things are safe and not. So for example, one thing that you could do, if you have some convex set of actions that are known to be safe, then there was this paper that Gary also talked about from Ziko Koulter, who showed how to back propagate through this additional constraint. So you can project things into sets and still differentiate through it. But this is not the same. Because when you say expert policy is more like a policy that is not gonna visit unsafe trajectories. But so this is more like looking at it as I don't know the final distribution over action. But I don't know if you think, for example, okay, like you could have parametrization of a controller or that is by definition, it will be safe. For example, I don't know, PID stuff. So I mean, if you have a controller that's safe by definition, then you just do normal RL, right? I don't know, I mean, such classes might exist, but I think it like for practical problems, it's not always so easy to find a policy class that for any parameters will be safe. Okay, thank you. Okay, so in principle, so that means you can plan over your state space the same way. And then what you could do is if you have, for example, some initial policy that's safe locally. So you have something that takes care of the infinite horizon nature of planning. So some kind of, for example, value function that tells you here in this region, I know what to do to how to stay safe in the longterm. Then what you could do is you can start planning trajectories that always kind of bring you back to the safe region, so safety trajectory. And then you can do exploration however you want as long as you ensure that you always have kind of this backup. And so in particular, if at least the first step of the two trajectories are the same, so if you kind of follow the very first step, you kind of know for every possible realization how to get to some safe region of the state space, then these methods will be safe kind of in an infant horizon style setting. And this is essentially it's using exactly the same ideas as before. It's just that the optimization problems are more tricky due to the safety constraints. All right, so now we've talked a lot about how to actually get policies, right? So we've seen from the value side kind of how to map epistemic uncertainty in models to epistemic uncertainty in value functions and use methods like CPO. And we've seen like very briefly how to do this in this kind of primal mode of directly reasoning about state distributions. So the last thing I want to do in the last half hour is to actually talk about safe exploration. Because one thing that we've kind of ignored so far is does this actually converge? Do we actually find really the best safe policy? And so let's just briefly define what we actually want to do, right? So we want to, at each time step, we want to select a safe policy party and then we're going to evaluate the corresponding trajectory and it's an iterative procedure where we kind of keep updating the policy. And what we want is on the one hand, we want this policy to converge to some notion of optimality, right? So obviously subject to constraints, but we want this to achieve high performance. And at the same time, we want it to be safe. And that means that an algorithm should proceed like this. So let's say we have this quadrotor going to this target and we have some kind of constraint, some red circle, a square here that we don't want to go into. What we want our algorithm to do is to start with an initial policy that just explores kind of maybe randomly without going into the square. And then over time, right? It can maybe get closer to the target, but still it doesn't go into the square. And then kind of towards the end, we want to find actually the optimal policy that is both safe and actually solves the corresponding RL problem. So what we've assumed so far is let's have a well-calibrated model. So somehow we have a model class where the epistemic uncertainty captures the true environment model as well. And then we've done something like this. So we defined some RL objective based on the policy. So for example, in the Lambda paper that I showed, so once with the multiple critics, there the expiration objective was this optimistic objective. And so some kind of principles RL objective that you know at least in the model base setting will kind of still find the optimal policy. And then we're adding some kind of worst case constraint such that the probability of violating this constraint is bounded. And so the question is, is this actually enough? Will this actually work? Like will we actually find a good policy? And so for that, we're going to also do a little detour towards the Bayesian optimization setting. And so in particular, okay, I'm not going to again bore you with the whole Bayesian optimization setting that we also abandoned setting that we also saw in the model base setting. But essentially now we're going to have an optimization problem. So pick parameters to maximize return and we're also going to have a safety constraint. Okay, GPs, you've seen that. And so here's the idea. So unlike in the previous example that we saw in the banded setting, we now have two functions. We have a performance function here, J and we have a safety function G. And for each parameter, these have particular values. And I also have a safety threshold. So somehow I'm only allowed to evaluate parameters that are above this dashed line. So kind of this notion of safety. And this is then kind of the safe optimum where I'm just about still satisfying the constraint and I'm maximizing this performance function here. And now I have epistemic uncertainty. That means in this case, it's a Gaussian process model where I have like essentially, I know nothing right now. And so as I said, our L always needs a safe starting point. So somebody gives me an initial sample that's feasible and based on that, I can kind of define already kind of that these parameters here are safe. And so what I could do now is I could take an algorithm that is known to work in the banded setting, for example, UCB, subject to the constraint that with high probability over this epistemic uncertainty, I will always stay safe. And if we run this algorithm, it will kind of kind of keep exploring here within the safe set and always maximizing kind of the upper confidence interval here. And what you can already see is that this gets stuck in a local optimum, right? The safe optimum kind of globally is over here, but somehow it gets stuck in this very local optimum. So if you remember from the model based RL setting, there in the banded setting, just greedy exploration. So just optimizing for the mean over epistemic uncertainty, that got stuck in a local optimum. Here, just by adding a safety constraint that we're also learning at the same time as trying to figure out the optimum, you have exactly the same problem. So just taking a working algorithm like UCB in the banded setting and adding a constraint that you're learning about at the same time, that also doesn't work. So somehow by adding constraints that you learn about, you're also making your exploration problem even more difficult. And again, right, this is the banded setting. It's a special case of the reinforcement learning setting. So if you do this in model based RL, at least in theory, right? There are instances where you will not converge to the optimal solution. And I'll show you an example of what that looks like later. So this is just what I said. Sorry, GPUCB kind of explores only in whatever you now currently know as safe, but it doesn't actively learn about the safe set. And this is the core of the problem here. Okay, yes. Okay, so combining existing exploration algorithms with safety constraints doesn't retain the exploration guarantees of the existing algorithms. So yay, we get to do more theory. So here's one algorithm that actually works. This is just active learning in the banded setting where rather than trying to find the maximum of the functions, so I completely forget about performance, I'm just trying to identify the safety constraints everywhere. And if I run this algorithm, it will kind of not get stuck in this local optimum because it only cares about performance, but it will actually globally explore and learn about kind of the entire safe set. Obviously not a great algorithm because it really learns about the safety constraint everywhere. But that is essentially one of the key algorithms that's known to work in this banded setting. So just exploration. And what people have done is essentially take this idea that exploration works, pure exploration, people learn about the safety constraint everywhere and try to make that a little bit less explorary in a sense. So what people have done is try to essentially define something that's kind of close to the boundary of your safe set and some notion of where the optimum could lie. So some notion of UCB. And then trade off between these two sets by only selecting the point with the maximum variance. And so here's the idea. So initially this starts out very much like the previous algorithm, but essentially now you can see here that the parameters that are known to potentially be optimal, they're kind of in this green set, they're still in this local setting, but there's still a bonus for exploring because on the boundary of the safe set, you also have very large uncertainty in this case. And so this algorithm will actually keep exploring because learning about the safe set is rewarded and it will eventually find actually the global optimum over here. So somehow an algorithm that actively tries to learn about the safety constraints combined with a measure of what's the uncertainty about my optimum. So it's a take on UCB if you want, where you just do uncertainty sampling in sets that look promising, right, from a UCB perspective. All right, and apparently be sampled there for a long time. Okay, and this actually has guarantees. Don't want to dive into this detail. So here are the references if you want to look in it, but so we never make unsafe decisions for this algorithm. It's called safe opt. And there are also some finite sample guarantees of how long it takes to explore your safe set and find the optimum. All right, and so here's a video that at this point is relatively old of just running this algorithm in an RL context. So what you can see here is we've defined safety as performance and what you see here is the mean performance where we're tuning two parameters on a quadrotor. And now what the algorithm does it kind of automatically selects parameters and tries to identify the performance function here, which is for us right now the same thing as safety. And what you can see kind of only selects parameters that actually satisfy this safety constraint, even though there are parameters in the set that would perform significantly worse after the video if you don't mind. And so essentially here, like we can see that this is already a method that kind of actively tries to trade off exploration, like learning about the safety constraint and performance. And yes, now the question, please. So we saw in this example that we have this safe set, which is compact, but what if we have another safe set, which is like separated? Is this also, I know there are solutions for safe Bayesian optimization, but are there also solutions for reinforcement learning? Or is this also a problem in reinforcement learning? We will get to the reinforcement learning in part. So the analysis for this algorithm is just essentially looking at finite sets, not even the continuous space so much. Like so at least the ones where you get sample complexity results for this trade off. And so yes, you can have disjunct sets at that point. Okay, so there's another video, we can skip that. And so essentially this kind of uncertainty sampling is essentially the, for the long time, was the only game in town. Just briefly want to mention, like it's another PhD student of mine. So he's been looking at trying to use information-based methods instead. So trying to get away from this uncertainty sampling style approaches. And so the idea is the following. We want to define some kind of metric for how much can I learn about the safety of a given parameter by evaluating a parameter that I already know to be safe. So we define this mutual information here between an evaluation of a parameter X that's kind of currently known to be safe and a parameter Z where it's essentially now the indicator function, right? I want to make sure, like to figure out is this kind of above zero, above the threshold or below. And so it turns out that, so this is non-trivial but you can actually get a close form approximation of this and for that approximation you can also get some basic exploration guarantees. And so this was kind of first paper. We'll have a follow-up that has more guarantees but here's the intuition. So what this function does is you kind of have your current safe set. You look within the safe set, like for a parameter X and you pick a comparator point Z outside. And essentially what you can see here is this information curve where evaluating kind of close to the boundary of the safe set gives you a lot of information. You could still learn about it by evaluating parameters further away but they provide you less information. And in particular if you have a point where your epistemic uncertainty is already really low then also you get very little information of this. And the idea then is essentially to jointly optimize over both a safe parameter and an unsafe parameter. So the idea here is you figure out what you want to learn about the parameter Z and then you can use this information criterion to figure out where should I evaluate in order to learn about it. And so this is still like, we're still kind of developing this but I think this is also something that might actually be useful kind of going later to the RL setting. And we see this at the end. And yes, so it does perform better than some previous work, at least in the noisy setting. Okay, back to model-based reinforcement learning. So this was a longer detour into like the banded setting but the key part here is that model-based RL looks at this axis, right? Like how does exploration interact with model learning? And we've seen that it's non-trivial. There's epistemic uncertainty in the model and we need to be aware of that when exploring. And what we've seen now in the banded setting is that safety adds another interaction in there which like safety is defined through the model. So if I have high epistemic uncertainty then I cannot explore there while still guaranteeing safety. And that affects exploration because if I'm just safe and apply existing algorithms then I will no longer successfully explore. So what does this now mean for model-based RL? Like what can we actually do? And I mean, I just want to point out, right? This is from a theoretical side. So it's about giving guarantees. The algorithms that we saw before they were doing optimistic exploration which theoretically is not guaranteed to converge but it still works. But only in settings where somehow the reward itself drives exploration sufficiently. So you don't have these settings where just by guaranteeing that you don't violate your safety constraints in terms of epistemic uncertainty that that completely destroys exploration. And so here's an illustration of this. So let's say again, we have some target and some starting state. And again, we have some obstacle. So here's essentially the problem in model-based RL that we just saw in the banded setting. Namely, we can plan a kind of safe, robust policy even if that tries to kind of be conservative with respect to the worst case but still maximizes kind of, sorry, it's to be conservative over the models but maximizes some performance measure based on pi. But the actual optimal policy without this conservatism might actually have a completely different state distribution. And now the problem is that if I select these kind of pessimistic policies I might never actually collect data here on top of the obstacle which allows me to figure out that over there there's also a safe and viable path. So at least in this kind of example it's very easy to see that if I only collect data at the bottom even though the top path might have lower cost I will never figure out that it's safe just intrinsically. So at least in these kind of problems which I mean the typical benchmarks that we design try to avoid these kind of problems. But at least kind of in these kind of scenarios you can think that an RL algorithm might have exactly the same problem where just going for reward subject to the safety constraint will not collect data that allows you to figure out that there's this other path that's also safe but has better return. So at least from a theory side these problems can actually occur if you combine existing algorithms together with safety constraints. So this is what I said, so the robust policy may actually never collect informative data that allows us to learn that the top path is safe. And so what are the solutions that we saw, right? One of them in the Bandit setting was just active learning, right? Like trying to identify the safety constraint everywhere. And for these kind of methods we actually have theoretical guarantees. Oh, very first. Like so for the linear setting that's actually the only setting that's really well understood in safe reinforcement learning. So there was actually like a really cool series of papers that looked at uncertainties in kind of these linear matrices. And then you can construct a convex optimization problem that will actually, I think this is published by now. It's not, and then you can actually construct a convex optimization problem and you are guaranteed that you eventually identify the optimal safe policy. And the key thing here is that in linear models you don't need to actively explore. So the entire analysis relies on the fact that random exploration is kind of good enough in these linear models to really explore like every relevant mode. All right, so back to uncertainty sampling. Like I said, this is one of the methods that's known to work in bandits. And these guarantees actually translate to the reinforcement learning setting. And so here's the idea. Let's say we actually want to certify that a certain area of my state space is safe under a given policy so that somehow if I act with my policy within this region, I never actually leave it. All right, so what would be nice is that if I get more certain about my model I want to expand this region over time and figure out that the larger region of the state space is actually safe to visit under my policy. And so here's a basic idea of how one can go about analyzing these kind of algorithms and giving actually safety guarantees or exploration guarantees. And so in particular, there's an idea from a control theory called Lyapunov functions. And these are functions defined on the state space where this is kind of this little blue set here in the center of that function and that we saw that we know initially to be safe. And these Lyapunov functions are positive definite functions. And the idea is essentially that you can look at level sets of these functions. And if you can guarantee within this level set here that the dynamic system kind of keeps going down hill on this function. So kind of dynamics push you towards this region here. Then there's a very classic result that if for each state kind of on this in the set the next state has value smaller than the bigger state then you can actually guarantee safety. And this is like a really classic result from nonlinear control theory. And now the key difference for us is that we now have this kind of unknown part of a model and that means for each state we are actually uncertain about where we will end up on our state space. And that means we will try to actually give high probability guarantees that under my model distribution here that every model will actually go downhill. So it's a means to map epistemic uncertainty to some notion of converging back to the safe set here. And so here's an illustration on like a simple 1D toy example. So again we have like the state space here is a line and we have the action space on the vertical axis. So this up here is my policy. So whatever I apply in this particular state I apply these different actions. Just a 1D toy example. And then I have a very simple Yaponov function here with a minimum kind of very locally. And I know that this kind of is my initial safe set just from prior knowledge. So it's my initial safe policy. And now I have uncertainty about my model. So you can see that kind of locally here I have very little uncertainty but I guess as I get further away from the state space my variance in my model so it's right now a simple GP model will actually increase. And what you can do now is you can start to collect data within this kind of current safe set. And as you do that you will see that the background color here shrinks and we're also doing policy optimization at the same time. It's not so important right now but as the uncertainty shrinks I can kind of certify larger and larger areas in this state space as safe. This is one way that you can map uncertainty about your model directly to some notion of safety. And what's really neat about this is that just from this you can now use kind of results from the banded literature and actually give exploration guarantees that look a bit like this where you start out with an initial safe set you kind of explore within it like try to derive uncertainty down about your model within the set. And based on that you can classify larger regions of the safe which opens up more avenues for exploration and eventually you kind of find some notion of a near optimal or like the largest possible safe set for giving kind of Lyapunov function. But so it's kind of one of the only analyses kind of in the nonlinear setting that actually guarantee you that you are able to safely explore and eventually find some notion of large safe set without violating the safety constraint. But this is kind of a fairly theoretical constructions and is really using only uncertainty. So on the more empirical side what people have tried to do is to try to do something like these kind of Bayesian optimization algorithms safe opt in the model based setting. So for example, here's one paper that tried to essentially trade off linearly between maximizing reward while also kind of trying to learn about information. So it's trying to combine maximizing rewards together with just pure uncertainty sampling. So some kind of trade off factor row. So while you're staying robustly safe you want to on the one hand maximize reward and on the other hand still find some system identification learn about the model everywhere identify new promising avenues for exploration. I don't think this is the end of the story. So this is right now largely where the story ends but I do think actually that down the road like if we kind of look into the future it would be really nice if we went one step further and try to find actually methods that try to actually actively learn about kind of this top path. So we might still optimistically this particular trajectory up here might be safe but we need some mechanism to kind of actively learn and gather information. And I think for this we still don't have actually the final answer. So there are actually some open interesting problems especially from this like more theoretical side that people can work on in the future. And so where can we actually go from here? So yes, question. Just a short question on what you just said wouldn't the case of the pure exploration you showed before actually find this type of path? Yes. It's just not very satisfying right because what pure exploration does is learn about the model everywhere. And one thing that's nice about reinforcement learning is that we're focusing only on what's relevant for solving the particular task. All right, so let's take the half cheetah to solve the half cheetah task you only focus on running forward. Pure exploration will run backwards up, down whatever your model can somehow do. So it will be a lot less data efficient because you need to learn about everywhere and reinforcement learning kind of works well because you focus on what's relevant. So back before kind of people looked at state distributions people started actually with give them a particular model how can I learn value functions and they would just discretize the entire state space and try to learn essentially value function globally everywhere. And that never like even if you throw function approximation in the mix that never worked as well as looking at the state distributions and learning value functions locally. So somehow by focusing on what's necessary for the task we actually gained a lot both empirically and also from a theory side. So focus exploration is much more attractive than global exploration, especially if you care about data efficiency. Thank you. All right, then let me quickly wrap up pretty much on time this time. So safe reinforcement learning. So I think one thing that's really interesting about this problem is that it actually brings together a lot of different literatures, right? So it's classical control theory that has something to say about it. There's formal methods, statistics, decision theory and like machine learning, all of these play together and you actually get really interesting exploration problems where you don't only have the problems that you normally have in model-based reinforcement learning but there's some extra bits that make it extra interesting. And so I think there's still some interesting directions so the theory and practice of actually how to trade off safety and performance I think that's not completely clear. So people have tried to combine existing algorithms with like existing model-based algorithms with safety constraints but that doesn't really carry over the guarantees. So it'd actually be nice how to do this in practice. So model learning itself is given that most of these methods that I just showed were model-based, obviously model learning is like a really crucial part also for safe reinforcement learning. And then I think like a better understanding when the safe exploration is actually easy so that these theoretical worst-case scenarios don't happen I think this will also be super valuable. Like when is it just enough to combine normal algorithms with like normal exploration algorithms with worst-case guarantees and when does this also converge? Because empirically at least for like these Mojoka environments, these problems that we construct that seem to be easy enough. Okay, and with that let me wrap up. So hopefully I convinced you in the beginning that for real world applications like we really care about data efficiency because we cannot collect a lot of data it's really expensive. And for some problems you actually also care about safety. And so a good way to deal with these kinds of problems is to have some notion of epistemic uncertainty. And in particular, I think model-based methods are very natural candidate for safe exploration and they're also good ways to model epistemic uncertainties in these models. As we've seen in the first talk, there are like also in this talk, right? Like even for POMDPs, there are methods that can model epistemic uncertainty. This is what I just said, right? So there are practical methods if you care about epistemic uncertainty you can, I think for that paper even the source code is online so in principle you can just run these methods but I think that still early days one could probably squeeze out even more performance there. So I think that's super exciting. You actually formal guarantees for safe exploration are tricky and right now they only exist if you really do this kind of global exploration. And then yes, so from that follows I think the next step would really be to look at methods that actively learn about safety also in this model-based setting. And this I actually didn't talk about. All right, yes, so overall so I personally really like this safe reinforcement learning setting because it's not only a twist on the previous setting but also creates really interesting additional challenges. And so in particular now that we actually have practical methods that also work in a deeper setting I think this makes it an even more exciting space to work in. So I hope some of you will pick this up in the future. And I'm perfectly on time, so. I'm for some questions. There's any. Thank you. So is there any existing work on how to do safety in these non-stationary problems? Like I mean, like many of the real world problems seem to be non-stationary where even the safety criterion can change dynamically. Is there any sort of comment that you can make like very briefly? So I think in general in the model-based setting so first of all, no, I don't think people have looked at this yet. I mean, we're just starting to get the handle on the stationary parts now adding non-stationarity to the mix seems like a good next step that being said in the model-based setting it seems very natural to then also think about learning kind of models that can adapt and change. So it might actually be that the key challenge towards doing this might be on the model learning side a bit more than the safe algorithm side. But I think it's a cool direction, so go for it. Let's thank the speaker again.