 Some of you might recognize the first speaker, because he was a speaker at the last edition of RSS. And so Herr Kevernhof is an assistant professor at the University of Amsterdam. And generally speaking, he is interested in making reinforcement learning more data efficient by leveraging a variety of structural assumptions. And today he's going to talk about TD methods with a functional approximation. So Herr Kevernhof, the floor is yours. All right. Thank you for the introduction. Take a seat. Yeah. So today I want to talk about temporal difference methods with a function approximation. So maybe before we dive into that, and maybe this is super obvious to most of you, but I thought it was important to start from this kind of motivation of like, yeah, oh yeah, I didn't put a password, so I don't know what it is. Okay. Oh, we'll let you know as soon as possible. Right. So first question is, okay, why do we want to do approximation anyway, right? So I think yesterday with Olivier you've probably looked at tabular methods, right? So we're building up a table for a free state action pair you learn and store a Q function. Herr Kevernhof, sorry to interrupt. The password for the slides is Pasta Naga. Maybe I can write it to the screen. You chose something easy. Yeah. Can I write it? I don't know if I can write here. I will write it in a slack as well. Does it work? No. Thanks. Okay. Is it a lower case? I don't know if we can make it a little bit bigger. Sorry for interruption. Oh, it went. I don't know. There we go. It's a little bit bigger. So it works at least. Okay. Did we get that? All right. But in many domains you have a very large number of states, right? And if that number of states gets very big, you might not even be able to store that whole table in memory. And obviously that's a problem. Or even maybe you can fit it in memory. But just right if you have this big table, you have to fill it up with data essentially, right? And your just your data requirement might get too high right before you can kind of independently learn a good value for each of the cells there. Of course, in the state space is continuous. It's especially problematic because then you kind of have infinite possible values for your states. So as a couple of examples, you have of course right kind of these famous papers from the last couple of years. For example, for Atari games where kind of rate your status given by the screen where you can have lots of different pixel values in the screen. You have the game of go where you have again millions of possible boards that you can create by having the black and white stones in different places. And like in robotics or control, you often have continuous values like right, there's these little cards that can drive left and right. And it has a position and a velocity and this kind of things. And all of these values are continuous. All right. And for all of these systems, at least to some extent, you hope that small changes in state don't impact the value or action that you want to decide that much. So you can generalize, right? You have two very similar screens in a Atari game or you have two very similar velocities in a physical system or something like this. Maybe you can get away with acting very similar actions in those cases. So essentially, right, with these two different aims, you want to represent the value function in a kind of compact way, right, without having to learn too many different parameters or values, but also in a way that allows generalizing an experience to nearby states. And that's, of course, kind of a very similar kind of criteria that you have in any kind of supervised learning as well, right? So on the one hand, you could take a very flexible function class like a huge neural network or something like this. But if a function class is too general, you might overfit. And if you have a very inflexible function, like just maybe a linear function or something like this, maybe it's not expressive enough. So you need to find a good balance. No matter what you choose, right, often you will not be able to represent the true value function exactly. So there's, right, because you don't want to go all the way to having something extremely flexible. So as soon as you're somewhere in between, you might not be able to exactly represent the value function. So you have to find a good approximation. And, right, reinforcement learning with function approximation actually has some unexpected subtlety. So you can just open the deep learning toolbox and apply it to RL problems. But then you'll find that it doesn't always work like maybe what you would find intuitive. And we'll explore that today. And to start off where maybe some of those subtlety lie, we can kind of look at maybe some common misconceptions. So I kind of try to find some papers. So for example, a paper from a recent ICML where they said, DQN minimizes the temporal difference error. And another paper from RSS where they said, Q learning minimizes the Bellman error. Both of these are not true. And at the end of today's lecture, you'll know why. And also what the correct sentence is. Our plan for today is we'll cover the following topics. We'll start with function approximation and semi-gradient TD learning. So it's kind of the first algorithm for learning with approximation that typically you see. Then we'll kind of see in what kind of problems you end up. So it's this deadly triad. And then we'll dive a little bit deeper into this errors in value function geometry. And finally we'll look at alternatives to semi-gradient methods. So how can we solve those problems? I have to say my own research interest is a little bit different. So I'm teaching this as a kind of, let's say, advanced basics course. I, myself, usually look more to actor-critic methods, hierarchical structures, and so on. But I'll, of course, try to answer any question you have as well as I can. The book is, of course, I guess you all know it. It's available for free online. So if you want to, let's say, after the lecture, read up on a little bit more details, and that's also a very good resource to start. So when starting to say, OK, we want to step away from this tabular method. So for every state action pair, we learn a completely separate independent Q function, for example, our value function or whatever. Now we will learn parameters of some value function. So we have basically a function that will now be a function of both the parameters that we choose and the state that we are in. And kind of the most simple way maybe to think about this, the first example is a state aggregation. So here we group states together, and we say rather than having one Q function or value function for each state independently, we are going to assign the same value to this whole group of states. So right here, I showed some system which has, for example, 200 states. Then we could say, OK, we group all of these together. Where it's blue, all of these together, where it's green, and so on. So this gives us kind of a basis, right? And then we can see what kind of functions can we create out of those basis functions. And this would, for example, be one value function that we can represent using that basis, right? So you get kind of these kind of functions that stepwise where a whole group of states is kind of, yeah, forced to take on the same value. And of course, in reality, the value might not be the same for all of these states. So maybe this is an approximation of the true value function. Makes sense so far? Don't hesitate to ask questions if anything comes up during the lecture. Now, you can, of course, make this a little bit more flexible. So rather than just say, right, we have this kind of piecewise constant approximation, we can also go for linear function approximation. In that case, we typically specify some transformation. So we have a couple of functions of the states, features of the state, let's say, so like x1 could be the x-coordinate and x2 the y-coordinate of the agent or something like this. Or it could be like, you know, the color of a specific pixel or whatever. And then, right, for this linear function approximation, we say, okay, we'll approximate the true value functions. There's a little hat for the approximation. That's just a linear function. So some factor of parameters in our product with that transformation of the state. Of course, we can write it out as a sum as well. So we'll say, okay, we'll look for the best value function within this function class, right? So one thing to say is that, of course, this value function is linear in W, but not in the state itself, right? Because we can have a nonlinear transformation here. And then the question is, of course, how should you choose this transformation, right? This is completely free. So you can pick anything that you like based on the kind of problem and based on what do I think are important features of this particular problem. If you kind of don't have an a priori feeling of like, oh, right, these are aspects of the problem that they should represent as one of these functions x, you can take some, let's say, generic choices. So you can kind of take a polynomial basis, right? So it's the same as polynomial regression. So these functions x1, x2, and so on are just gonna be like a linear function and then a square function, then a third power, and so on. And you just go as far as you think you'll need or you can try out a couple of different values. You can also go for like a Fourier basis where like, right, maybe this is like a first property of your problem, like the x-coordinate and it's the y-coordinate, but now I apply a kind of Fourier transformation where you take the sine and cosine of different directions or you can take radial basis functions where you specify a couple of centers and then the activation of each of those features will just be based on the distance from those centers. Yeah, as I mentioned, you can also pick anything else you like which is kind of task-specific. So if you know, oh, right, in this maze, I have my agent and an enemy and those two things are really important, maybe I should include the x and y position of both of them or something like this. Of course, this function aggregation, what we just saw in the last slide, is a special case of this linear approximation, right? So if I choose this x1, x2, and so on to be, oh, I cannot easily go back, to be these functions, these kind of little step functions that are just on for like, right, wherever we are in that set, then we can represent the value function also in this way. Okay, so those are linear approximations and of course we can do nonlinear function approximations, so essentially that's everything else. So this can be, of course, very popularly be chosen as some kind of neural network, like a feed-forward percept instructor as your multi-layer perceptron or it can be like, right, an LSDM or a conf net or whatever you like. And we'll see some advantages and disadvantages later on. We'll mostly focus here just because like kind of the theoretical properties are more straightforward on the linear case. And we'll already see, right, kind of that that's maybe a much simpler case but we'll already see that that's maybe not so simple. Now, before going into how we can actually learn these type of value functions, let's quickly revisit what we saw before. So I hope this is something you saw with Olivier. So if we just do simple tabular TD0, right, we, for every state, we maintain a value and then every time we get a new transition in the environment, we update that value with a learning rate such as this one, so we have a learning rate there. And right here we get some term which you might know as a TD error. And if you look a little bit closer into this update, you can kind of distinguish two parts. You can say, okay, we have the current value at that state according to the learned value function so it ends up here and here. But then there is like another term here, this r plus gamma times phi, which is you could say that's kind of, we would like phi to be more like that value, so that's like a kind of target. And with function approximator, basically the intuition is we kind of want to roughly the same thing. But now we can't say we'll update the value for this particular state. We have to update the parameters W, which are going to affect the value for many different states. So we still want to kind of go in a direction that this term becomes smaller, closer to zero. But now we do this by multiplying that whole thing by the gradient of the value function. So you can kind of see, of course, what this gradient indicates is gradient indicates like if I change W, does this value become larger or smaller? So if this term is positive, so the target is bigger, then we're going to go in the direction that this phi will change W in the direction that this phi will get bigger. And if this is negative, so the current phi is too large, some negative number comes out here, so we have a negative number multiplied by the gradient, so we'll move W in the direction that phi gets smaller. So intuitively, that makes some sense. And this is a very popular method. We'll see DQN, for example, in a lot of detail this afternoon and the next talk, and it's kind of based on this type of insight. But there's also some, let's say, complaints you could level against this method. So the first thing is like, it kind of looks like, maybe, that this is based on the gradient of some error function. So maybe this is a kind of inner derivative of an error term based on this TD error or something like this. But actually, it's not a gradient of some, let's say, well-known error term, because this TD error has phi in two places, right, where it's W in two places, it's both here and here. But in this gradient term, only this last bit shows up. So it doesn't seem to be, let's say, a true gradient method. So just to clarify that a little bit, I put in, right, what would that look like if we would take something like, right, the square TD error as a kind of loss function, right, and one-half times the square TD error, and they would want to move in a negative gradient of that. Well, we can write the gradient of that term, right, so we have the square here, so we just take that whole, the outer derivative is just that whole thing, just repeat it here, times two, so this 0.5 disappears. And then we should take the inner derivative, right, so which terms here depend on W? Well, that's both this guy here and that guy there. So we have two terms ending up in this inner derivative. And we see that up to here, it's the same as this semi-gradient method, but there is this additional term here that doesn't appear there, right, and this is the gradient of the target. Okay. So, right, which is not to say that this would be a good idea as a method to implement, right, so I don't want to argue that, but I just want to say, okay, look, this equation there, it's not a gradient of something like the square TD or something like that. And because of that, this method is called the semi-gradient method, right, so it's not quite a gradient method. And what's also important to know, this semi-gradient method is not going to minimize the TD error, so if you look at where this converges, it's not converging to the value of W where the TD error is smallest. Yeah? So, what was the question? Why is this not a good update, the one you presented, the actual one for the TD error? We'll get back to it. Okay, yeah, a couple of slides. There's also a question, I think, is here. Okay, the same one, yeah. Okay, oh yeah, there's actually the last two points here on the slide, right. The function doesn't minimize the TD error, but we'll also see that minimizing the TD error is not in general kind of what we want anyway, so we'll go into detail on that later on. Okay, but let's for a moment stick with this learning rule. Right, if we want to implement this as an algorithm, how would that look like? Well, essentially, we have all the ingredients right here, right, so we can plug in everything that we know, so we know W, we know R, we know the current, right, we know we have the current weights, so we can evaluate this in that term. The only kind of thing where we need to do a little bit of effort is this gradient of the value function at the current value of the parameters. So, all right, we can do that. For linear function approximation, that term is extremely easy because remember from the earlier slide if we have a linear function approximation, this V term is just going to be, right, this inner product, so we just have the gradient value of W times X, so that's just X, right, so that's very nice and easy. If, of course, we have some nonlinear function approximation, the calculation gets a bit more involved, but we can, like, PyTorch or TensorFlow can take care of that for us, right, so we can plug that in and ask a kind of backward computation from PyTorch or TensorFlow. So we can now see for, again, for some very toy example, okay, what type of results do we get out? So, right here, they had a particular system, I think this was just a random walk system, so essentially you start in state 50 and every step, randomly you go like one step left or right, so you end up going back and forth a little bit and finally you end up at zero, you get like minus one reward and if you end up at 101, let's say, so if you go out from the system on that side, you get the reward of positive one. So the true value function is just kind of this linear slope. I don't know if it's quite linear, but almost that the further you get to the ends of the chain, right, the more you get to the associated reward there. And you can also look at, if we now use the semi-gradient TD, right, for a function aggregation scenario where, again, just as we said before, we group the states together and we'll force the algorithm to give the same value to this group of states, right? So there's just one weight corresponding to all of these states together, one weight corresponding to those states together, and so on. So rather than learning 100 separate states, we're learning, I think, it's 10 different weights. I don't know why there's this weird gap there. And, right, if you observe this graph, right, I mean, of course, one thing is very clear, right? If I'm going to represent my value function with this kind of stepped function, I will not perfectly retrieve the red line, right? So kind of we expect maybe these kind of artifacts where we're going to be a little bit off at the corners, right? That's fine, but we also see something else, which is that, like, it seems that, you know, if I had to fit this kind of stepped line to the red line, I could maybe do a better job than this, right? Like, here, it's kind of just too high and all of these could have been a little bit lower and it would have been overall maybe a better fit. So, yeah, this is a property of this type of methods. They don't converge to the solution. That's, in a sense, closest to the true value function. So maybe that surprises a little bit and it raises some question, like, actually, does this always converge? And can I characterize the type of solution that is converging to? That's something that I want to look into a few a little bit. So let's directly dive into that question of what does it converge to? So we can, for the linear value function approximation, we can look at that theoretically quite easily. So we can say, okay, right? We have these recursive updates where every time the new weight is, right, this type of function of the old weight and how do we see this as linear function approximation? We'll remember here at the end, we had the gradient of v and we said, okay, this gradient of v in the linear function approximation case is just x. And what we want to know now is, right, if I'm just running this update over and over and over again, at which point is the expected new value the same as the old value, because at that point we have converged, right? If on expectation, we are not going to change this w any further. Okay, so I kind of go and rewrite it a little bit. So essentially I just pulled this x inside over here and then I kind of can factor out, where we go? Oh yeah, sorry. And I've also pulled this w out, right? So now we want, so this is just the td update, right? So even if we're kind of at convergence, if we keep doing this update, we will see it kind of jumping around a little bit because the transitions are stochastic, right? So we wonder about, okay, what's the expected update going to be? So it's the expected value of the new weights given the old weights. Okay, so we also take the expected value of the right-hand side. I pulled it in as close as possible to the random variables. So we see, okay, on expectation, the new weight is going to be the old weight plus alpha times this factor here, which I'll call b. This matrix here, which I'll call a, times the old weights, right? Does this make sense? Yeah, so good question. So this is following the kind of the convention of the Sutton and Bartow book, which is, I mean, it has, so there's different conventions in RL which is sometimes used. So essentially in every time step, the start of the time step is, for example, xt. So we'll have xt, then we'll take an xt, then we'll get reward, which is then called RT plus one, and we'll get the new state, which is xt plus one. So RT plus one is, let's say, the reward you observe just before, let's say, transitioning to this state. And because of every transition, we also update the weight factor, right? In this normal Q learning or TD learning approach, the W's are indexed by the same T. So when we are at step 100 of the episode, we'll both at, like, we'll be at x100, but we'll also have the 100th update to W. Of course, if you're kind of in a different learning paradigm where you kind of batch updates or you do experience replay or something like this, maybe it doesn't rate, maybe you won't have the same indexes for the tool. Okay, and we want to know now essentially when is this whole thing going to be the same as W, right? So when is the new value of W going to be equal to the old value of W because at that point we've converged. And when we are at that point, we'll call that point WTD. Now with a little bit of linear algebra, if you just kind of solve this equation, so you want to have here WT equals and then that whole thing, so you see this is just a linear equation where we have WT appearing in the distance, right? So that's very easy to solve, so we just get that there's WTD is just going to be equal to this A matrix inverse times B, okay? We can do the math. It's also in the number two book. I didn't want to go through it here, but it's not very complicated. So we now know there is a point which is called WTD, which we can just, if we know the expectation of these values, which of course, for a unknown system, if I just drop you in an Atari game, you won't know these things, but for a simple system, if you have the knowledge of what are the transition probabilities and so on, you can calculate those and that's where TD is supposed to converge to. And we won't do any proofs, but I'll just tell you that on-policy semi-gradient TD, so semi-gradient TD is the algorithm that one. And on-policy just means that we are trying to learn the value function belonging to the policy that is also picking the actions currently, so it's also the behavior policy. If we do that with independent linear features, we are guaranteed to converge to that fixed point, so to this TD fixed point. So that's somehow kind of, let's say, a special point that semi-gradient TD converges to. Now we can kind of very quickly also introduce you to another algorithm, so we could also try to find the same fixed point by trying to learn this matrix A and learn this matrix, this vector B. So just every time we have a transition we can calculate, for that particular transition, we can calculate this term and this term, we average those out over all of the time steps that we have. So this expectation just becomes an empirical average. Then we just do this computation. That's another way to try to calculate those fixed points, so that's called LSTD least squares temporal difference. So that's just, yeah, it's maybe interesting to know. Right. So now so far we looked at learning the V function, right? And I guess you've covered it already yesterday. Yeah, you can of course, typically either learn a V function or you can learn a Q function. If you only want to know how good is a particular state, V function is fine. But if you also want to be able to control, so be able to pick the optimal action, then learning Q is easier, right? Because if you learn the Q function, you can look at, okay, I am in this state. Now I can plug in all the possible actions I can take, and then I take the action where this Q function is maximal. And you can essentially do exactly the same kind of update that we talked about for the value functions, the semi gradient update, you can do exactly the same thing for Q. So I wrote it down here, but there isn't any specific surprises in there. Just everywhere we used to have a V, we just put in the Q function. So this is, let's say, a SARSA like method, right? Where the update also depends not only on the AT, so the action we are about to choose now, but also the action we are going to choose in the next time step. Now this is the equation you can use for episodic tasks. If you want to do that, you need a small modification. I don't want to go into that now, but that's also in the Sutner Bartow book. So how could we write an algorithm for learning control, right? So for obtaining the optimal policy, well it would look something like this. We have a current policy Pi. We select an action according to that policy, then we use this equation to improve the estimate of Q Pi. And then we improve Pi. Then we improve the policy to be a soft approximation of greedy policy. So, for example, something like an epsilon greedy policy of that Q function we are estimating. We can look at that maybe in algorithm form. So also take it from the Sutner Bartow book. So, we start with some value function parameterization. So I have to choose the function class. I have to choose the step size. I have to choose some convergence criterion. And I can initialize the weights of the Q function. So the state action value function arbitrarily. We can make it start at zero. We can start at some random value. It doesn't really matter. Then you're going to generate episodes in your world. In the episode, you start with some random state and action pair. So S is typically given by the world. A could be chosen by epsilon greedy, for example. And then every step of the episode we'll execute that action in the world. We'll observe the reward in the next state. Well, if the next state is terminal, we immediately execute the update function and go to the next episode. So it's important here in this case, there's no Q of the next state, right? Because we've just terminated. If S was not terminal, then we'll, and this is a little bit special if you're implementing something SARS-like, you already have to commit to the next action you're going to execute because you need it in your update function. So we already pick the A prime. So the action we're going to execute in the next time step. In this epsilon greedy manner. Then we execute the semi-gradient update. And then we just do some bookkeeping. We set the state and the action, current state and action to this S prime and A prime that we have already committed to. And then in the next iteration of the loop the next thing we're going to do is actually execute that action in the world. So it's important here, right? They said here, oh, there's a step. We want to improve the policy to set it to a soft approximation of the greedy policy under that Q function we're learning. You don't really see that step explicitly in this algorithm. But of course where I'm choosing A as a function of the current Q function and I select that from the epsilon greedy policy of Q kind of implicitly you could say every time right, you're kind of selecting a policy which is the epsilon greedy policy under that Q function. It's just kind of hidden within that action-selection step. Okay. So far so good. This was all on policy, right? So the policy used to select actions is the same policy that I'm updating. But as you might already know kind of off-policy is usually very important, right? Because maybe you want to have take explorative actions in the real world, but the policy you want to obtain is like the optimal policy, so a policy that doesn't take exploration actions. Or maybe you have actions that were generated according to a policy that you already knew before and you are not really free to do exploration in the world now, so you kind of need to learn offline. So then we have this problem that the algorithm kind of assumes that it gets transitions generated according to the policy, but now it's going to actually get transitions generated by some other policy than the behavior policy. Or sorry, then the target policy. Yeah. And to correct for that difference, we're going to always have these importance weights. So these importance weights have this fraction in there the fraction of the target policy to the behavior policy. So this is going to do if I have an action that's for example going to be taken very frequently by the behavior policy, but it would be taken very seldomly by the target policy then this weight is going to be very small, right? It's going to be smaller than one. So we are getting this sample more often than we should get, so we kind of correct for that by giving it a low weight. And vice versa, if there's an action that's behavior policy would rarely take, but actually the target policy would take it a lot. Every time we do run into one of those samples, we have to give it a high weight to kind of again compensate for that. So principle we can correct these type of updates just by plugging that importance weight right after the learning rate. So essentially you could say you make the learning rate a bit higher, so you make a bigger step whenever you get one of those samples that are actually a little bit too rare. So we can see how this type of method behaves, right? So I drew here a little system. So the system is very simple. There's two states. My function approximator, so linear function approximator, it works as follows. I have some weight, maybe to 10 at the beginning. In this state, right, I just have set the value of value and in that state I set the value of 2 times W. The policy is just that whenever I'm here I go there, whenever I dare I go here and I never get any reward. So logically the only value function that's consistent with that system is the value function that gives zero value everywhere, right? So because they need to be obviously the same in the two and any W that's not zero, I would have different values there. So if I set it to W, what we'd want is the value to slowly get returning. So what will now happen? Well, if I make a transition from left to right, so I have this, the value at the state where I'm coming from is W, my target value will be 2 times W. And I'm just setting the discount function to be 1, just to make it simple, right? So if I just fill in that semi gradient learning rule that we had, I'll get that the new weight is going to be the old weight plus, and then there is this CD error and the gradient of the value function, so I can also fill all of that in. And we see I could also put a 10 in, but what we see is we're going to make this W actually a little bit bigger, because this is going to be a positive number, this is going to be positive 1, times W and W was also a positive number. So if we go from W to 2W, the target of 2W, so we'll make this W a little bit higher. So it's going to be going a bit in the wrong direction actually. But then when we go from 2, but equally often as we go from here to here, we go again from here back to there, right? So what happens if we go back? Well it's exactly the opposite, the current value is 2W and the target value is W. So we'll get W minus 2W times the gradient of the value function where we are, the gradient of 2W is just 2, right? So we'll get the value here, which is exactly the negative of what we had before, but now it's multiplied by 2. So whenever we have this transition, we make W a little bit higher, but whenever we go back, we actually make it lower by a bigger step. So on expectation we go down, which is exactly what we want to bring this W down to 0, right? So this temporal error on average tends to shrink. So that's very good, right? That's doing exactly what we want. But this was on policy. So now let's see what happens in the off policy case. So now let's say the behavior policy is going back and forth, but the target policy, the one that we actually want to learn the Q function of, is going to stay in that second state. So whenever we go from left to right, it's exactly the same as before, right? We get this semi gradient rule. But now there is this extra importance weight here. But since from left to right, the behavior policy and the target policy are the same, both policies go from left to right. So that importance weight is just going to be about one. So exactly the same thing happens as before, the W is going to get a little bit higher. But now what happens is we're for these transitions where we go right under the behavior policy, we go back from right to left. Well, we have the same thing as here, but with this importance weight. But now this importance weight is going to be zero, right? Because the behavior goes with probability one to the left, but the target goes with probability one. It stays there, so it doesn't actually take that action. So we have to correct for that. So we take that little step which goes in the wrong direction, but we never correct for that to go in the right direction. So this is, of course, a little bit rough example, right? So it isn't quite correct because for example, these importance weights don't actually work if what the target policy does, if it has zero probability under the behavior policy. I mean, no matter what you do, you can't correct for that. So this example is not 100% valid, but it kind of starts raising a question which is when we have these importance weights which are much, much smaller than one, maybe the TD error can keep increasing indefinitely. It doesn't go down as we actually like, right? So this is a bit more to build intuition, but now to give, let's say, a completely correct example, we can look at the bear's counter example from the book, which is constructed in a little bit more complicated way. I will not walk through all the details. But what we can see if we look at how do the weights in this example evolve, we can actually see what's happening. This is clearly not converging, right? These things actually start changing more and more actually as we are learning more, right? So this is just the amount of learning steps and this is just the size of the weights, right? The value of those weights and they just kind of seem to kind of be on a kind of exponential trajectory, let's say. Right. Now, you could ask a couple of questions about this counter example, right? So why is that? Why do these errors keep getting bigger and bigger? Maybe it has to do that the true value is not exactly representable, right? So with a class, a function class for the function approximation and maybe this problem arises because that class doesn't contain the true value. Well, it's actually not the case because I can make this example a little bit richer and make sure that even the true fee is representable and we don't find it. Maybe it's due to random effects. Well, that can also be excluded because there's also deterministic versions of this example where we're still, we are diverging. Well, and the last thing you could say maybe it has to do with dependent features, right? Like the value here and the value there, they both rely on this w8 weight. So maybe that's to blame. Well, actually you can also exclude it. So right, then read through then what essentially, right? And yeah, in the past, people have done a lot of work to kind of try to drill down and in the end the conclusion is that this divergence occurs if you have three factors that kind of all come together. So you have function approximation, you have semi-gradient bootstrapping, so the semi-gradient method that we've been talking about so far, and you have off-policy training, right? So as soon as you have those three together, you can't guarantee that your yeah, learning process is going to converge anymore. And yeah, that's not something that's very easily solved. So maybe the first question you could ask is, could we do without any one of those three ingredients, right? So well, let's look at all of them. Can we get rid of this function approximation? Well, at the start of this lecture we said this function approximation is actually very important, right? We often have, like, too many states to practically be able to learn each of those states independently and without function approximation, we cannot scale too large or continuous domain, so domains with a larger continuous state space. We could also say, well, let's get the rid of bootstrapping, right? So I guess that yesterday Olivier probably talked also about we have TD methods, like Q-learning and SAR science and we have one that learned from single transitions and we have Monte Carlo methods that learn from an entire rollout, right? So I look from a certain state, what's the whole future until the end of an episode. But those methods, they don't have this problem, this divergence problem, but they tend to be much, much slower to learn, so we kind of don't want to, you know, we don't want to get back to them. And then we can ask, okay, right? There may be this off-policy learning is problematic, maybe we have to get rid of it and, you know, we could, we can just say, well, we'll just always learn the policy, we'll always learn the Q-function of the current behavior policy that we're looking at. But it's not quite what we want, right? So if you look, if you kind of think a little bit kind of to your future, maybe what you'd like to do is learn all kind of tasks, right? Like you have a robot and you want the robot to be able to run and to walk and to stand. And then you can do while, let's say the robot is trying to run and it kind of learns something about balance or it learns something about like the no collisions with walls or something like this. There's also valuable experience for when it's trying to walk or to jump or to stand, let's say. So if you can learn off-policy, you can take the data while the running policy is active and use the same data to learn to walk, for example. And with off-policy learning, you can learn all kinds of things, right? So let's say I have recorded data, so for example, from someone piloting the robot with a joystick or something like this. Or maybe I have old data, so data generated by a very old policy, I can maybe get something out of it. Or maybe, right, I have some, let's say, safety critical situation where I have a controller that's been, yeah, that's been thoroughly fetted that people say, okay, this has been a lot of problems, so we are comfortable with that. We have generated millions of data points using this old controller. Can you now use reinforcement learning to give us a new controller? But before you apply that new policy to the safety critical system, we want to first be able to fet it. We want to first be able to make sure that this is really doing what it's supposed to do. So you only have this old data policy and you can't do exploration with your new policy that you're learning. So in all of those cases, yeah, you do really want to have a policy that can handle off-policy data, the sort that can handle data coming from any of these sources and use that to learn a new optimal policy. Right. Any questions so far? If a question over there. Hello. Sorry. I just wanted to know if you still have a problem if you go from model 3 to model based error, because in my mind if your behavior policy has a positive probability of sampling every state action pair, you don't have this problem in model based methods, but I'm not sure if that's true. I'm thinking about that in model based. So I guess like if you have kind of a lot of different model based methods, right, so some model like learn a model and then you kind of train your policy on that model let's say, right? So the model is just like a proxy for real interaction with the system. Then nothing really changes, right? So instead of having the real system, we have that simulated system but this is also a simulated system and things go wrong, right? So that could also happen with a learned model. I guess one whether the kind of this depends on a lot of detail, you could also say right? In that model, I can now look at a lot of on policy you do a lot of get a lot of on policy experience, right? So well, then if you're on policy this is not a problem, right? Whereas on the real system, you might say it's on the real system, it's much more critical to minimize the amount of interaction I have with the system. So I want to get the maximum out of all data that they have lying around, right? So I really want to be off policy. So in that sense you could kind of maybe get around it by just saying well, if I have a model and I can get infinite data anyway I can get as much on policy data as I want, why would I even want to do off policy? But it depends because there's also other, right? They think like learning a model and then just using a model free technique on the learned model as if it were the real system is just one approach within the model-based kind of toolkit and for other methods I'm not exactly sure how much this would apply or not. All right. Okay. So, right, so we said okay, we have this problem when those two things, when those three things come together we can have possible divergence and it's not so easy to just take three of those things out, right? Or at least that doesn't seem to be a path towards let's say, having reinforcement learning work at scale in applications we care about. So maybe, right, we have to just maybe we can do bootstrapping but just get away from this particular semi-gradient method that seems to be maybe a little bit problematic, let's say, right? So maybe we have to kind of take a step back and said right at the beginning of this lecture I said well, we have tabular methods that go in the direction of minimizing the TD error and we can kind of throw in the gradient of the current value function and then we get something that we call this gradient method maybe we can take a step back and think well what error function do we actually want to minimize and can we then define a let's say a more robust algorithm to actually minimize that error right? Right. So this semi-gradient TD method might be a version of policy setting and right we already said okay this is not really a gradient method right and if we have like gradient descent methods those typically come with a lot of let's say guarantees so maybe if we kind of can find a nice loss function and then define a true gradient method on it maybe that will be more stable or more we can guarantee maybe that it converges in a wider range of settings and now we have the problem then of saying okay but what error shall we actually minimize right? So if you know a little bit about reinforcement you know it can kind of quite quickly get confusing you have the TD error you have the Bellman error and you have like a bunch of other things you can define something called the value error so what should we actually take and those things tend to like maybe be a little bit similar to each other right so we have the TD error which is the one we've maybe seen a lot right which is just for a single transition it's just okay how what is the difference between the current estimate of the value and this type of bootstrapping targets right so what we can define as a potential loss function is the what we call the mean square TD error so we denote TD for TD error with a bar above it which indicate this kind of let's say mean TD error right so what you do is like you have this TD error in every time step this TD error is always squared right and then you take the expectation of it expectation over all actions and then you take another expectation over states and of course for that you have to choose a state distribution right so some measure on the states and what we tend to find a useful measure is this visitation distribution so we take a distribution over states which is just like if at a random point when the policy is interacting with the world I kind of look what state are you in right now what's the distribution of states that I would get that's this mu okay so that's one possibility you can also look at the mean square Bellman error so the Bellman error so this this difference is a bit subtle but the Bellman error is to expect the TD error over all possible outcomes so right so if I take the expectation of the TD error over all actions and next states and then I square the whole thing right and then I take the average then what do I get right and yeah we'll see we'll also see an example later on where maybe this difference is more clear but mathematically the difference is really whether the square is inside expectation or outside expectation there's another thing called the mean square projected Bellman error we say okay this Bellman error says essentially right what the difference is between expected target and my current value but maybe the important thing is maybe that target value maybe that's not even maybe that's not even representable by my function approximator by let's say my linear function that I'm trying to fit so maybe right there isn't any point where that's zero for example in the value functions that are representable so what if we only look at the component of that loss that is actually in the space of functions that I can represent so then we get something called the mean square projected Bellman error where I now put in this supply which is this projection operator that takes the error and kind of projects it down to the plane of functions that are representable I have a visualization later on that hopefully makes it also a little bit more clear anyway what's important at this point is it's not quite obvious which error function we should take or even maybe how these errors relate to each other so to kind of make that a little bit intuitive I drew the following little example so let's say with tiny little system with only three states right so how can you look at the space of all possible value functions well for any possible value function that I could imagine I have one value assigned to the first state one value assigned to the second state and then one value assigned to the third state so you can think of any value function can be represented as a point somewhere in this 3D space right where the access are exactly those three values that they have to assign now if we have function approximation we say okay right maybe we are not free to choose any point in this three dimensional space maybe I just have two parameters that I can choose right the w1 and the w2 and those are going to be multiplied with some features that I've defined on the state so maybe my features are if I am in s1 my features are like 1,0 for this two I have two features so s1 is like 1,0 s2 is like minus 0.5 0 and s3 is 0,1 for example right so now what are all value functions that I can represent well that's all value functions that I can represent are the span covered by these two vectors 1 minus 0.5 0 and 0,0 1 so I can draw that right those are these two axes right so w1 is going to be multiplied by this vector and w2 is going to be multiplied by that vector so these vectors span a 2D subspace so you have to kind of interpret the picture a little bit three dimensionally where these s1, s2 and s3 are the phi of s1, s2 and s3 span a 3D space and this blue plane is kind of a diagonal plane in that 3D space right so any value function that's in the blue plane that's something that they can represent exactly and any other value function in the 3D space I cannot represent exactly right I have to approximate it and some of them are well approximate some of them are very close to something I can approximate and some of them are so this picture makes sense it's going to get a bit more complicated so if this doesn't make sense then please okay so now we can look at a bit more complicated figure but it's kind of based on the same principles that we had before and this figure comes from the southern embargo book so here you have to still imagine that the 3D space which contains all the possible value functions and now the two factors spanning the representable subspace are kind of the ground plane of the 3D thing right so the plane that was diagonal before we kind of tilted it and it's now the ground plane there's something wrong on the slide but I guess you can fill that in right so we can think of if we are not bound to let's say our class of function approximations right if we just do tabular learning based on like normal Q learning where we don't do any approximation then what do what does learning look like right so we can look at from a certain point right so I have a certain value function what happens if I look at the error factor right so the Bellman error factor points somewhere we can denote it as the Bellman operator times that old value function then we end up at a new point in space and this Bellman error factor right we saw before this Bellman error is actually the average of TD errors so if you end up if you do something like TD0 or something like this average you go in this direction on expectation you go in this direction so of course one time you go a bit more here one time you go a bit more there but you end up averaging going there and then if you keep doing that you end up here which is a true value function right so if you don't have any approximation you do like TD0 or something like this you end up at vpi so that's the ideal place where you could end up and if you do of course like a dynamic programming then you kind of take exactly always these steps in the direction of the Bellman error so if you do dynamic programming you really follow this path if you do TD0 or something like this you kind of stochastically go let's say a little bit around this path but you also end up there converging to there right now we are approximating so we don't have a chance to go here right so everything we can do we are approximating is stay in this 2D plane at the bottom so first question we can ask without worrying about learning dynamics is what's the closest point in that 2D plane to that ideal value function so right we can just project it right down right we can do linear predictions since we're kind of a linear setting and project down so then we have a point here in the 2D space some point the minimum value error so it means the distance between this point and yeah the true value function is smallest we make the least the least difference in values that we assign to the different states right so that could be one possibility of something that we want to go for but right of course we can't do that we can't do the learning in the unrestricted space because we are approximating right so we cannot even represent this point here right because this point is not in the representable subspace so trying to find this point and then projecting it down that's not something that we can do algorithmically right so we have to do something a little bit different so what could we do instead well I already in the last slide introduced a couple of different error measures that are based more on like TD concepts right that are based on the difference between how and where am I and where did they end up after the transition so what we could do is we could look at which point in this representable subspace has the smallest bellman error for example there's not generally a point where the bellman error is zero not necessarily but there is some point where the bellman error is smallest and actually that's not the same as that point with the minimum value error so it could be that the minimum value error in general there could be still a quite large bellman error and maybe there's some other point where it's smaller so the point where the bellman error is minimum is typically different to the point where this value error is minimum if we now yeah this is what I actually talked about that so maybe we have to update according to the bellman error right so we could say okay we want to go in the direction of this bellman error factor the same thing that we did when we were doing like td0 in the tabular case but right I cannot actually represent this point so after at the end of one step what should I do well maybe at the end of every step separately I have to project down into this representable subspace so maybe the step that I should take is this one here this is like the projected bellman error so maybe I should try to kind of go in that direction right and then in the next step I would do the same right in the next step I would still again look at like what's my bellman error factor projected down and then I would go in that direction right and I would kind of continue learning like that where at the end of every step I'm still in this representable subspace what do I mean exactly actually when I say projecting down well it's kind of finding the closest alternative to a desired update and then when you say closest you always have to define what do you exactly meet right so the norm that we use to define closeness is also the norm under that visitation frequency that we had before I mean if that doesn't mean so much to you it doesn't really matter then this kind of notion of we find the closest is good enough so we can wonder okay if we always go in the direction of the projected bellman error then do we actually end up at that point that we talked about before that point where the bellman error is minimum so we can kind of look what happens and actually something else happens so actually there is some point where the projected bellman error is zero so essentially if we look at what's the bellman error at this point those are always the blue arrows it's kind of pointing straight up and then when we project down we land exactly in the same point that we came from so we have a point where the projected bellman error is zero it actually is a point we already know I kind of drew too much on top of it bit behind there it says WTD it's this TD fixed point so that was the point like if semi gradient converges so if we have linearly independent features and we are on policy then we actually found that find that point and LSTD also found that point so it's a point we already know and actually that point in general is different from the point where the bellman error is smallest so we have some point where the value error is smallest we have some point where the bellman error is smallest and then we have this point where the projected bellman error is smaller than it's zero actually which is the TD fixed point that we already knew and then just to make the picture maximally confusing we can also think of where is the TD the average TD error lowest it's generally another point none of these four in general coincide I mean the question is kind of what the right one is right if we are anytime we are approximating we have to accept that maybe you know we can't find that one right so that the true value is not in that subspace I think whenever your space is such that this optimal value function is in your space then all what is it four of them coincide in one point and that would be the ideal space right but if we knew beforehand what the ideal space was then we wouldn't have this conversation let's say right so whenever this guy is not in that space then we have a different solution concept another question there in the back yes can you go back one slide maybe yes one more yes exactly so you consider this distant function in the left but then I'm wondering the visitation measure with respect to what policy do you mean oh everything is with respect to I mean this is all on policy so it's always the behavior policy okay but and the values you look at are of the of the learned policy right so because if they weren't then you were looking at the measure of the policy where you look at the values and then kind of your distance function depends on where you are in the space quite bad I guess so we're looking at yeah so here you just put the difference between the value functions right so if you want to know how far is this guy from this guy that's something that depends on the visitation frequencies which depends on the policies always the same wherever you are in the space this whole space the policy is the same only the value function is different we're not updating the policy we're only doing policy evaluation the policy is always the same and the distance measure always depends on the frequency which depends on the policy and then we have something that depends on the difference between like for example this value function that value function is that answer your question? thanks for far back thank you so I had another question about the visitation frequency as I sort of understood it it's the stationary distribution of the Markov chain induced by the behavior policy right if you are looking at infinite horizon it's the stationary distribution if you're looking at episodic settings right then you don't really have that concept of stationary distribution because you always go back to some starting state distribution and that influences but yeah of course so suppose like even in the infinite horizon setting right so if you had the transition model you could of course calculate it easily right from a matrix operation like in the on-policy setting like is there any pathological case where this you can't actually find this distance function because the Markov chain doesn't really mix you can't get it from the samples is it like you can always find an estimate that's good enough so I think there's two remarks on that the one is so we need this concept of visitation frequency for the analysis for kind of thinking about you know what are these methods going to converge to it's not necessarily that also in the final algorithm that you propose you're going to have this concept of visitation frequency right like if I do a semi-gradient TD which is what we're kind of also analyzing here there is no visitation frequency in the update right but of course if you kind of want to understand where does this converge if we keep following it well the states where you or the state action state pairs where you execute that update rule those are governed by the stationary distribution right so that's how it comes in but it doesn't come in explicitly so you don't need to actually know it to be able to run the algorithm so that's one two is so essentially you never need to really approximate it right it's just kind of a theoretical concept that it should exist it doesn't always exist either because for example if I have I forgot the technical term but you could have like an MDP which is for example disconnected so you could have a bunch of states where you can go from one to the other and another bunch of states that you can go from one to the other but you cannot go between these two sets for example right and then the visitation frequency that I the sort of states that I end up visiting depend on whether I'm initialized in the one set or in the other set so there are MDPs where this concept of stationary distribution or frequency is not well defined of course like when the Markov chain is not a good but yeah like so I was thinking about this update according to Bellman error so like that's not really implementable because of this because for the projection you would need the visitation frequency that's right so we could try to right so I mean the question is still whether that's what we want to do but if you'd wanted to come up with an algorithm that's based on following the Bellman error you wouldn't probably get something that exactly goes in direction of the Bellman error but go in the direction of some stochastic approximation of it just like if we're doing tabular TD0 we don't go exactly in this direction of the Bellman error factor but we go in the direction of this yeah like a stochastic approximation of it which like right sometimes it goes to be more this direction sometimes it goes a bit in that direction but we go in the expected direction of the update that the Bellman error factor is the expected direction so in that sense we could maybe come up with something yeah so we were here right we had all these different type of errors that we could try to minimize and kind of they all sound reasonable right like okay why not minimize the TD error or something like this but right we kind of have to choose which one should we choose so maybe we should just kind of look one by one to see well is this actually a good error to be minimizing so we can start with the TD error which I already said before well right you could maybe change that semi gradient to TD to be a true gradient of the TD error but is it really something that you would want right so we can look at a very simple example again taken from the Barto book to see maybe what kind of problems that could exist right so if you look at this little system so if three states a b and c you have the rewards that are kind of written on the arrows here and the policy here is 50-50 between top and down so right if you kind of think about it if you kind of think about it as a kind of symmetrical argument right like from a in the end I have equal chance to go to B and C from B always get one from C always get zero so what value kind of intuitively do you think would minimize the TD error at a 0.5 yeah excellent so okay let's fix a to be 0.5 by symmetry and then we can think of okay what value should we assign for example at B right so at B you could assign the value of one then you have an error here which is one and you have an error here sorry an error here which is zero right because you have a value here of one and the TD in the TD target will also be one so one minus one is zero but you have an error here right so on the transition of A to B I come from 0.5 and I go to one so if an error of 0.5 or a square error of a quarter right I can probably do a little bit better right because you're going to be squaring that error and you're going to be squaring that error and I can shift if I shift B a little bit downwards this guy goes up a little bit that guy goes down a little bit because I'm squaring I'd like B to be exactly in the middle right so I'd like B to be three quarters right then the TD error and then the same C I'd like to be at one quarter then all the TD errors of all these four transitions together are going to be minimal right and maybe that's not what we want right if we just look at B if we look doesn't matter how we got to B what's going to happen after well from B we're always going to get the reward of one right so the only value that we would really be happy with to assign to B will be one we don't want to assign three quarters to B because somehow now I'm adjusting the value of B downwards because of something that happened in the past right and that's typically right kind of it seems counterintuitive let's say um so for that reason we say okay this minimizing the mean square TD error that's probably not what we really want right and that's also why saying oh we have this semi gradient method which looks like it's a gradient of the TD error let's make it a true gradient of the TD error why that's not really what we want to do because then we would find that solution where B is 0.75 okay so what instead we could do is minimize the mean squared Bellman error right so in that case um I could assign one to B and I could assign 0 to C and now we can look at what's the Bellman error and still have 0.5 at A for the same argument as before and I could say what's the Bellman error at A so now I have to take the average of the TD error if I end up this transition and the TD error if I end up this transition so right with this transition I have a TD error of negative 1 and this transition I have a TD error of positive 1 so the Bellman error is just the average of these two so the average of positive 1 and negative 1 is 0 of course right so if I look at what's the mean square Bellman error of that particular assignment of values the Bellman error is going to be 0 right so that gives me a solution concept that I'm really happy with it agrees with maybe the values that I would assign intuitively okay so now we have something we say okay that's really desirable and then we can think of like can we actually reach it right can we define an algorithm that actually finds those type of solutions and there actually we can prove that we cannot do that because I can have two different environments that generate exactly the same data yet they have different solutions for the mean square Bellman error so if I have any type of algorithm that takes data and spits out a value function I cannot put a box there that is happy for both of these two environments right because yeah the data will be the same I'm going to spit out the same value function and at most one of these two environments I'm going to be correct the other one I'm going to be wrong right so the minimization of the mean of the mean square Bellman error is not possible for data you need to have access to the underlying MDP you cannot do it from data alone if you want to see what this counter example actually is also it's in the in the Sato and Bartow book alright so now for bootstrapping basically this predicted Bellman error is the only one left right we have seen that this one is not really desirable this one is not achievable this minimum value error is also right this is something that's not based on TD errors this is something that's based on Monte Carlo errors so that's not something we can get through bootstrapping so the only thing that we have left is the minimum of the predicted Bellman error and actually now that's maybe starting to look quite appealing right so we already know that if semi gradient descent converges it converges to that one right so it's kind of maybe something actually kind of maybe already makes sense and we know that if we have linear approximation that the projected Bellman error is 0 at that point so I mean that's maybe also also nice so okay so now we can say okay we can already find that if we are on policy and if we have linear function approximation then a semi gradient descent finds that semi gradient TD now we can think of okay can we also find an algorithm that's maybe a little bit more stable that's based on a bit more let's say directly based on trying to find that point and which also works in the off-policy setting so I don't want to go through all mathematical derivations because of the time again right this is all in the set to a bar to book but right you can show through kind of a sequence of steps that I can actually define a true gradient algorithm that starts with this loss function of the mean square projected Bellman error and tries to and finds this equality for the gradient of that error so well for this gradient of error we get like this product of three factors that are all based on expectations if we have a single expectation we can always get a pretty good estimate of that expectation by just plugging in samples and averaging it in this case we cannot just do that because the first factor and the last factor both depend on this random variable that's kind of not calculatable let's say not a deterministic function and maybe for some of you that's intuitive why that means you cannot just plug in samples maybe it's not for everyone so I kind of made a small example to illustrate that ok ignore that so why is it a problem so for example consider this very simple example consider a random variable y which is a uniform distribution over minus one and positive one and let's say I want to find estimate this term expectation of y times expectation of y so what's the expected value of y y is either negative one or positive one so it's expectation is zero expectation of y times expectation of y will be zero what happens if we just take this equation and just plug in samples from the system right so we're going to take this expression y times y and just plug in samples either I'm going to put a negative one I get negative one times negative one I'm going to get positive one it should be an average sorry not a sum but it doesn't matter in this case so I'm going to get positive one if I now with a sample put a positive one I get positive one times positive one if I average these together I get out positive one and actually so it's not the same so that's actually the first thing to to realize so actually what we're approximating it's approximating in this way it's not expected value of y times expected value of y actually what we are approximating is the expected value of y square that's of course not the same as this guy over here so what we can't do is plug the transitions and plug the values of the transitions in this long expression and hope to get a unbiased estimate of this gradient that we're looking for but we can define another algorithm so for example one algorithm again that's also discussed in the book is we can look at this whole bit together as a intermediate value and we'll learn v from data so this is going to be based on like kind of a longer sequence of data from the system and then with a new whenever a new transition comes in we look at that learned value of v and then we just plug and then we just plug the reward and features of the current state features of the next state and so on and then we just plug in for this first factor and so there is a rate we can define a kind of elegant algorithm to kind of approximate this whole bit so if you look at this bit it looks a little bit like the formulation of the least square problem if I have input features x and output features delta so actually this just looks like an incremental implementation of the least square problem so we have an update equation for v and then we have an update equation for the overall w that's approximating the value of function now we have two learning rates we need some technical conditions on the values of beta and alpha so what we have now is an algorithm so we said actually if we're on policy is minimize this predicted Bellman error now we say okay let's then just start out from the point of view of we have that predicted Bellman error and we want to find its gradient its true gradient and then we kind of find as one of the possible solutions is this particular algorithm to minimize that error which is now because it's a true gradient method it's much more robust it also works in the off-policy setting if you add in the required oh it's already there the importance weight so this algorithm is called GTD2 so it's called gradient TD and well right there's different methods based on the gradient TD concept so GTD2 is just this particular method there's also other methods based on the same let's say principle and what this does it can be proven to converge to zero means for a predicted Bellman error if you have linear features if you have non-linear approximation this still converges so that's the good news but you're going to get a local optimum right which is I mean in general if you have non-linear approximation that's all you can hope for anyway and this is not the let's say the Beall and Antol there are more sophisticated gradient TD methods but they won't be able to cover them today you can also wonder okay what do we now pay right what do we pay for having this kind of fancy method um well we have to within an extra tuning parameter so we rather than just setting the step size for W this alpha oops this alpha over here we now also need to set the step size for V so we get right we get again more tuning variables like as if we didn't already have enough let's say right um and of course we need to store and update two parameter vectors right so we get uh twice the memory requirement and at times the compute requirement of basically whatever algorithm we had um which is I mean I guess it's like non-terf fuel but also doesn't seem like excessive I guess um so I guess now we can think about uh okay can we somehow uh think about all of this a little bit in the bigger picture right so there was a lot of details but now can we kind of summarize what are these different algorithms that we have and right which of them are actually uh uh converging not converging what do we actually converge to and so on so to kind of um let's say capture that overall fuel um and make this table uh where I put the different methods we have looked at here so it's semi gradient TD we also LSD which you very quickly covered like in one slide right we were thinking about what the semi gradient TD converge to and then we had a gradient TD right so this GTD2 algorithm as an example of it where we just covered in the last slide um so the first thing I can note is that right we saw in the example that semi gradient TD doesn't necessarily converge um if we are off-policy setting right so I already wrote there it there's no convergence right um if we are non-linear I didn't we didn't really talk about that but if we have non-linear function approximation um semi gradient TD also doesn't necessarily converge so right in these two cases you might also potentially have problems LSD is by its nature right it's least squares TD it's by its nature linear method so we can't do anything if we're not linear right so I can already kind of write that down so now we have what do we have left uh about 10 different cases left um and they are actually very simple so everywhere else on the table they are going to converge to the minimum of the predicted Bellman error right so there are some conditions so I tried to summarize the most important ones of them so of course if we have a gradient method we need to take an appropriate step size schedule right so you always need to have these um yeah step sizes that decrease over time in a particular way to guarantee convergence to the to the optimum um for linear function approximation if your features are independent there's a single solution so you always convert to the same point if your functions are not independent um there are multiple optimal solutions so you're going to converge to one of them it's always going to be an optimum but there's going to be multiple optima and depending on how you initialize sometimes you might fight one and sometimes you might find another um which is probably fine but it's good to know um if you're in the tabular case whether it's on or off policy um right we know that uh we find the optimal solution so the point where the predicted Bellman error is minimum is also the real optimal value function everywhere else we're approximating so the point of the minimum predicted Bellman error is not necessarily the uh true value function because maybe the true value function is not reproducible in our subspace okay um then we can wonder about right in which cases do we have a global convergence so where do we find the global optimum that's also uh kind of easy it's whenever we have linear approximation we find the global optimum of this predicted Bellman error and whenever we have a non-linear function approximation we might find a local optimum right question so far right um I see we're running out of time but let me quickly uh wrap up to make a little bridge to a Vincent lecture after the break um so right Vincent will talk about um uh dqn so the famous algorithm for learning to play Atari games and this illustrates some of the challenges in doing control with function approximation or also with non-linear function approximation right um and um well I don't want to go into the details here because that's what Vincent's going to do um but um to make the bridge to what we have covered um in my lecture um is that well we've looked mostly at linear function approximation this linear function approximation has nice properties but you need to start specific features to these to design them right um if you plug in a deep neural network uh in there instead you can feed it like very raw information like the raw pixels from your screen or something like this right and you could think of the neural network as kind of learning its own features let's say right um um and uh well Vincent will go into all the details to like how that is exactly implemented um but I think what's important for this tutorial is to realize that this dqn um learns off policy so kind of right you're in this dangerous territory with a semi gradient version of q-learning so this is very roughly uh uh the update on which dqn is based so you see right just as we have seen today we have a target that actually also depends on w which doesn't show up in this um this part of the gradient and what we've learned today is that semi gradient learning used off policy is potentially unstable and so one of the things that they do in uh dqn to actually stabilize that is that they don't plug in this w here but they maintain a kind of a copy of the parameters that's updated at a slower rate so it doesn't kind of you break a little bit that dependency and this uh copy of the parameters is used in the target so now this actually becomes more like a let's say a kind of true gradient method now this is not the only thing that that they did to actually make this work so uh Vincent will discuss this in more detail yeah with that I want to wrap up um so I guess we've seen that on policy control with function approximation is relatively straightforward but as soon as you go off policy whether you do prediction or control um it gets quite tricky quite fast um and we've seen that one solution is to use this gradient to the algorithm a bit more complex algorithm but it follows a true gradient um and this overhead you have to keep track of more values and you have an additional hyper parameter um this no okay I didn't this is actually not correct what they wrote uh I wrote down it only works for linear function approximation that's not really true um but um I mean this is made mostly for let's say relatively model systems let's let's put it like this so um of course with dqn DOS is looking at this massive neural network right so um that's I guess very hard to analyze uh analytically and to really guarantee origins but they do demonstrate kind of empirically that this just works really well right on the satari games and then and so on um so uh and and for that they use essentially additional let's say more or less heuristic mechanisms to keep learning stable um so uh yeah the power of that is that it's uh uh yeah works really well empirically and without having to manually design specific features with that um I guess we should differ questions to the break since I'm a little bit over time if there's anything else yeah thanks any last minutes question okay so I think I'm going to go on the on break and come back in 15 minutes thank you very much okay