 So our second speaker is Vincent-François Lavey. He's an assistant professor as a VU Amsterdam. He works in deeper reinforcement learning. You may know him from his wonderful manuscript called An Introduction to Deep RL. And I think he's also interested in generalization in reinforcement learning. And also he's actually the organizer of the last edition of RLSS. And I don't think I can overstate this, but after organizing this with a dozen of people, it's quite impressive that you managed to pull this off more or less alone last year. So thank you again for that. And now the floor is yours. Great. Thanks a lot for the introduction. So it's nice to be here for this next edition of the Reinforcement Learning Summer School. And today I will talk about function approximators and in particular deep learning in reinforcement learning. So basically, I will give you the basis of what you need to make deeper reinforcement learning work. And this afternoon, you will have a tutorial where you can actually put this in practice. So here is the outline of the talk. So the first 10 slides, more or less, will be really basic things. And I will cover that very quickly, not to bore you too much. If you have any questions, just stop me. Then we'll dig into how we can actually use deep learning as a function approximator. We will look at also different variants. So basically, in this part, we look at deep Q-networks that you probably have already heard about. And in this part, we look at some variants of these deep Q-networks. Then some people say that deep reinforcement learning is only used in some toy environments that it's not useful and those kind of things. So we'll discuss some real-world examples of deep reinforcement learning. And then, towards the end, we will discuss some other topics about generalization and some links with upcoming tools like representation learning and model-based reinforcement learning. But basically, the most part of this talk, as we will see, is model-free deep reinforcement learning with deep learning as a function approximator. So as you probably all know by now, deep learning has been able to use reinforcement learning as being able to leverage deep learning to improve generalization and to tackle complex problems like games, lottery games, game of go, game of poker, even real-time strategy games that are very complex with very long-time horizons, huge access space, huge state space as well, and then a growing number of real-world applications as well with reinforcement learning thanks to the use of deep learning. So here, you see what you probably all know. Deep reinforcement learning agent that has been trained from the pixels. And that's basically using deep learning as a function approximator. And this is kind of the thing that you absolutely need to have some generalization and to be able to play these games, because then you do not need to have seen exactly the same frame in your training set to generalize to slightly new situations. And that's how you can generalize thanks to deep learning. So basically, the learning algorithms in reinforcement learning may include one or more of these components. The first component that reinforcement learning algorithm can use is the value-based approach where we estimate the value functions that provide the prediction of how good is a state or a coupled state action. So in the case of a state, it's the V-value function. And in the case of the coupled state action, it's the Q-value function. So that's the value-based approach. Another approach that is also model-free is a direct representation of the policy. Usually, we write pi of s to denote a deterministic policy that outputs a given action in the action space. Or we denote pi of s a to denote probabilistic distribution over the action space. So here we have a stochastic policy that for a given state will take some action from a distribution. So these two are model-free approaches because from experience with model-free, we can directly act in the environment. When we know the value function or the Q-value function, more in particular, or when we know the policy, we can directly act in the environment. And then another set of approach is called model-based approaches. And then they usually work with planning. And when you know the model of your environment, you have not finished the job. You still need an additional step, which is either planning or relearning in some ways the value and the policy based on the model. And these approaches, these kind of more indirect approaches are called model-based. And we'll only talk about that towards the end of the lecture. So we will focus here on value-based approaches, and in particular, with deep learning as a function approximator. And you see here that you can also combine the different approaches. But again, we will focus for most of this talk on the value-based reinforcement learning. So you already know this. The expected return, v pi of s, takes a state as input and give a real number as output. And it denotes the expectation of the sum of future discounted rewards. So the further you are in the trajectory, the more discounted the rewards are. And you start in a given state, s, and you follow policy pi. So that's the v-value function. The q-value function is more or less the same. But instead of directly following policy pi in state s, first you take action a, and then you follow policy pi. And then otherwise it's still the sum of future discounted rewards in expectation over the future trajectories. And the good thing about the q-value function, which has already been recalled already in Bayerker and also from previous talks, you get from the q-value function, the optimal q-value function, you can directly derive the optimal policy. OK, so from this q-value function, basically where the definition is the sum of future discounted rewards, you can rewrite it by making the q-value function itself up here in the definition. And basically, you have a recursive definition of the q-value function. So here, you just take the immediate reward out of this sum. So for gamma equals, for k equals 0, you have gamma equals gamma exponent 0. So this is 1. You just take the rt out of it, and then the sum starts from 1 to infinity. And then this sum can be rewritten as the q-value function starting from an action that follows the distribution pi directly starting from state at time t plus 1. So and this recursive definition of the q-value function is basically at the basis of the Bellman iterations. And you can, in particular, try to fit these q-value functions, not for any policy pi, but for the optimal policy with this. And basically, that's what you use for the Bellman iterations. And we have nice properties in the case of finite mdp. Because by applying this of the truth from the Bellman iteration, we convert to the optimal q function as long as we have some learning rates that are big enough throughout the learning while decreasing fast enough. So that's basically what this condition is. So the sum over the different time steps of the learning rate should be infinity. But the squared of the learning rates, when you sum them over all the iterations, should be less than infinity, should converge. So basically, the learning rate should still decrease sufficiently fast to have this condition. And the second condition, besides the learning rate, is that your exploration policy should be explorative enough so that you can basically visit an infinite number of times every state action pair in your environment. So in the finite mdp setting, finite state space, finite action space, we converge. So that's great. But in many cases, we cannot work with a finite mdp. And why is that? Because for large scale problems, we have what is called the Curse of Dimensionality. So for instance, if you take a robot with about 10 features for the state, and you want to discretize each feature in 100 bins, you already have about 10 exponent 20 states, which is already becoming prohibitive. If you look at chess or Go, you have even more states. So you have three problems, basically memory problems, compute problems if you want to work with the finite mdp setting. And even maybe more importantly, you have no generalization in the limited data context. So you really need to have your agent to visit every state action pair in your environment. And that's not what you want. So because of these three problems, you need to go beyond these tabular approaches. So let's move on to function approximators. Any question? I guess, hopefully, I've not told you so much for now. You are still fresh for the next part. OK, great. So function approximators, what are they? Well, a function approximator in general is a function that takes as input some x and gives as output some y. And it is parametrized with some theta belonging to a vector, basically of size and theta of a vector of real numbers. So we can write it like this. So basically, it takes as input x. And it also takes as input the parameters theta. And usually, one notation that is common is to have the theta that are after the semicolon such that the theta are clearly the parameters of the function approximators, and the x is the input. And so you can have, for instance, linear function approximators with the purple line here. So even though you're true function that you would like to approximate is like this, if you have a linear function approximator, well, you will never be able to do better than this. And then, for instance, here in this case, we show different polynomials. And when you increase the degree of your model in your polynomial, then you can get better accuracy, but you also have some risk of overfitting. So here's an example in the supervised learning context. So the different types of function approximators. So for instance, you have linear function approximators. They are nice because they are differentiable. And in a very simple form, as it was explained by Erker in the previous talk. But they are not so flexible. So basically, we cannot target very complex problems when we use linear function approximators. And then you have a whole bunch of other techniques for function approximators, such as SVMs, tree-based approximators, et cetera. They are more or less flexible. But they have one drawback. They are not differentiable. And basically, neural networks are both flexible and differentiable. And maybe these two characteristics could actually somehow, it might be a bit controversial, but could actually be what defines what a neural network is. OK, usually, you have this kind of deep learning architecture where you have to speed forward and to back propagation and those kind of things. But basically, as long as it is differentiable and flexible, people might want to call this a neural network, right, nowadays. So it's a bit controversial, but basically, as long as you have these two characteristics, then you can do nice things. OK. And yeah, so what you can do is basically bring the generalization capabilities of these function approximators through reinforcement learning. So gradient descent, I guess you all know that. But just a very quick recall about that. So you have an objective function, G theta, for instance. And then you basically start with some random initialization of your parameters theta. And you move these parameters in the direction opposite to the gradient G theta. So you calculate the gradients at each iteration. And then you move in the direction opposite to the gradients with some learning rate alpha k. And you update the parameters theta that way so that you minimize your objective function G theta. So that's gradient descent. And it requires your objective function to be differentiable so that you can move, of course, with gradient descent towards optimizing your objective function. OK, so now let's get into actually using deep learning in the context of reinforcement learning. Any questions so far? No? OK. So let's get into deep Q-networks. So what do we want to do? We want to approximate these Q values with any given state and any given action. So here it could be any pi. But in practice, we will aim for learning the optimal Q-value function. And to parameterize these Q-value functions such that it generalizes, we use function approximators, which is basically kind of shown by this. We use parameters theta that are the parameters of the null network that represent these Q-value function. So basically it takes as input S, A, and A. And depending on the parameters theta, it will output some values. And in particular, in the Q-network, instead of giving S and A as input here, we usually give just the state as input and provide an action for each different finite action at the output. And you can do this if you have a finite number N, A of actions. And that's what deep Q-networks are based on. It's based for kind of any continuous state space but still a finite number of actions. So that's what we will look at for now. And the good thing about this structure of the Q-network, instead of putting S and A as input, is that from one feed-forward path in your Q-network, you get all the Q values at the output. So that's when you want to look at the argmax, for instance, well, you just leave them right away in one path. You don't need to go to a path in your Q-network for each possible action, which you would need to do if you give the state and the action as input of the Q-network. So you give the state, and then you get directly all the Q values for all possible actions. And so what do we do? Well, we use basically the semi-gradient update. So we start with parameters theta. And we move in the direction opposite of this gradient of the objective, where the objective is the square of the difference between the current estimates of the Q values for state action per SA and the target. And the target is the one-step look-or-ed expected return. So that means we look at, for a total state action reward next state, we look at the reward plus gamma and the maximum expected return that you can get from the next state onwards. So it's kind of the one-step look-or-ed expected return that you can estimate from the information state action reward the next state. And that is your target. And basically what you try to do is get your Q values close to the target by minimizing this objective with gradient descent via the parameters theta that parameterize your Q value function. Yes, question? Good question. For the target, you are using the argmax action. And how come the Q value of the argmax is used to update all the actions in the neural network? OK. So you try to modify the Q values here only for the given state action pair. But here for the target, you indeed look at the max over all your actions at the next state. But here you only say explicitly, I want to minimize. So if I go back to the previous slide. So for instance, let's say that it's the second action that you took in a given state S. You will want that this value outputted by your Q network will get close to the target that depends on the reward and the next state you end up in. So you only try to modify for this given state and action pair. But here you look at the max over the whole action space. Any other question? And so it might be that actually by modifying this, since you optimize the parameters theta, it might be that you also changes the values over the other actions. But that's not what you explicitly try to do. And then you, of course, also need to have visited at least sometimes or to be able to generalize over your rule action space. But then you get that information from other tuples. OK, so why do we use the squared loss? So basically, you already had a lot of theoretical information about the semi-gradient parts by Erker. And he also gave the information about what the derivative of this squared loss gives you and how it converges. But why is the least squared error useful? This is really a kind of a very general question, right? It's not only for reinforcement learning. Why is there a squared here, right? We could have maybe exponentry. Why not? We could have no exponent at all. Why not? It's strictly positive. Yes, but it's not. For instance, if you take exponent 4, then maybe at least you would have the same thing. But that's not what you do. Yes, yes, that's definitely an advantage. But again, if you take, well, yeah. But again, for instance, if you take exponent 4, it would be convex in the same sense. Then what you say, right? Yes, yes, it's getting closer. The gradient is? Yeah, so yeah, it's also due to this, yeah. So basically, the reason why it's because by minimizing the mean squared error, we basically convert this to the distribution's mean, right? So for instance, if you try to optimize C to fit this expectation for samples y taken from the distribution of the random variable big y, and you minimize this, your C will be equal to the expected value of y. So with some simple example, so if for a given state action per SA, 20% of the time you get a reward of 1 in the terminal state. 75% of the time you get a reward 0 in the terminal state. By minimizing this with the squared error loss from the samples of this distribution, then you will fit Q of SA equals 0.25. And that wouldn't be the case if you take exponent 4 or exponent 3 or anything like this. If you take exponent 4, C will not be 0.25. It will likely, yeah. So basically, it will penalize more the big errors than the small ones. And so basically, it will want to be closer to the reward of 1 here. Because even if it only happens 25% of the time, it will want to be closer to 0.5. Because 0.75 exponent 4 is way bigger than 0.25. The error that it gets here if it was at 0.25, exponent 4, right? What do you mean exactly? Yes, exactly. Yes, so for 4, it will be equal maybe to some kind of other statistics of the random variable wide, but at least not the expected value. And not, yeah, higher moments. And not the expected value that you're interested in. And that's what you want to have, right? You want to have your Q values to be the optimal expected value of the return, of the discounted return. And that's why we have the square. OK, so now let's see how we do this in the deep Q networks. Because here, so we have this equation that is basically what I've just explained. But there are many metrics that are important to bring in. Why is that? Because as Erker mentioned in the previous lecture, as soon as you have nonlinear function approximator, and even though you use the semi-gradient that is important in this case to convert to something meaningful, with nonlinear function approximator, you will have a risk of divergence, or at least instabilities, right? So you have to bring some specific tricks to learn as best as you can, even though you are in the complex case of policy learning, function approximators that is nonlinear. So you need to make sure that even in that case, you are able to learn effectively. So in deep Q networks, there are a few, there are two key things. The first one is to use a replay memory. So basically, you keep a large buffer of information of previous state action rewards and next state, right? And you put them in your replay memory that allows you to keep a broad range of information that you acquired in the environment. And then from that, you samples mini batches from which you learn the updates of your Q values based on the rule that we saw earlier. So first thing is you keep a large replay memory of your past transitions. You sample from them to calculate this update. That's the first thing. And then the second thing is you use a target network. So not only you use this ID of the target that is kind of fixed in the sense that you use some gradients, so you do not update the theta here when you minimize this objective. But not only do you do that in the deep Q networks, but also you actually freeze the target network for a longer number of iterations. And this constant C is usually, for instance, 1,000 iterations or 5,000 iterations of that order. And basically, for that number of C of iterations, your target here is kind of fixed. So it's kind of similar to fitted Q iterations. Before you have ever updated the theta, for instance, if you start with Q values that are initialized with more or less 0 everywhere, well, at first, you will only fit the immediate rewards. And only when you update after C iterations the neural network, will you try to fit the immediate reward plus the reward at the next time step that is given by your Q network that you have updated one time. So it's kind of the ID. But the thing is that you do not wait before fitting completely either the Q value function. So it's kind of using a target network that is kept fixed for a given number of iterations and updated every C iterations. Is it more clear, these IDs of target network? So basically, you kind of make a copy of your neural network Q that you keep fixed for C iterations and use this in your target. So that's why we get theta minus here. That clearly emphasizes that this is a copy from a previous neural network that is kept fixed for C iterations. And these two elements were the key for scaling the deep Q networks to the Atari games in the 2015 papers. Now let's see what it looks like when we apply deep Q networks in a very classic environment, such as Mountain Car. So I guess many of you already know this example. You have a card that needs to go uphill. And it cannot do it in one go. So it needs to go a little bit forward, a little bit backward. And then it gets uphill. And what we can look at is what deep QN does when you try to fit the value function. So here we have the two features that form the states, so the position and the speed. And then in the z-axis, you have the v-value function. And what you see in the video is the process of learning these v-value functions through time. And what you see is that here, even though we use a replay memory with some mini-batch and that we use a target network, you see that there are already some instabilities in the learning. And in the end, it converges to something pretty close to the optimal Q-values. But if these instabilities are worse than what you see here in this learning process, if they are worse than this, then you will give complete divergence in your learning algorithm. Or you can keep some bad instabilities that will lead to a policy that is never really good. So you need to be careful about the stability. That's why at least the two things, replay memory with something in mini-batch and target network, are important. And then once you have learned this v-value function, so basically here the v-value function is just the max over the action space of the Q-value functions that are actually learned in the DQN algorithm. Once you have learned this, well, basically you can also, in this small environment, because there are only two continuous features, you can also visualize the solution. So basically you start from position zero and speed zero. And then basically by following the actions that every time corresponds to the max over your Q-values, you will follow the optimal policy and you will get up there like this. OK, so these were more or less the basic things that were introduced in deeper enforcement learning to make the learning fine. But there are many other tricks that we will cover today. But any questions so far? Yes. Isn't that visually, you said there are some instances. Yes. So for instance, let's look at, I don't know, maybe position minus one and speed and maximum speed. So we can look at this point. So here it's more or less at the final value where we can estimate that it has converged. But if you look through the learning, you will see that it needs to go towards the correct value. But sometimes it even overshoots and then goes a little bit too low and then a little bit too high. So these are the instabilities. So it's not only that you need to wait for some iterations to converge. It's like in this process of convergence, you will go a little bit too high, a little bit too low. And these instabilities are kind of propagated towards the updates. Because if you make an error, for instance, here, then if from that state and a given action you end up here, well, that error is propagated to those Q values as well. And this is the effect of the instabilities. And these instabilities are also getting worse when you have a high discount factor. So basically, if you have a discount factor of 0, that's the extreme case. You're only fitting the immediate reward. Basically, you cannot have any instabilities. It's just supervised learning. And when you bring this discount factor higher and higher, you have more and more instabilities. Any other question? The value C? Oh, yeah. So here it's probably 1,000 iterations. Any other question? Oh, yeah. I know that instead of the maximum operator, you can take an expected value over the next Q values. Could that make the learning more stable, or how would that influence the learning? So here, you're trying to converge towards the optimal Q values. So that's why you have this max A. So if we, yeah, so you take the, basically, you take, in a sense, the expectation because you have your tuppers. Every time you take a given, well, it's not the right equation. But yeah, but isn't this a larger variance sampling of that expectation? Couldn't we just make a sum weighted by the policy values over all the possible next states? Oh, OK. So basically, what you try to have is this part, right? Yes. So and you try to fit this in expectation, this one step looker head when you are given state S and when you take a given action A. And then you look at, over all these tuppers, basically, that starts with S A, you look at the rewards you obtain and the next state you end up in. And you calculate this expectation. And that's what you are trying to fit, basically, with these semi-gradient updates with the bell manipulation. And you do it in the specific case of the max over the action space because you don't want to converge towards any policy Q pi. You want to converge towards Q star. OK. Thank you. Excuse me. Yes, sir. So like I had a question. So when you have a replay memory of, as you said, say, size 32, right? And so the mini batch size is usually 32. But then the mini batch is selected statistically randomly. So the samples that are not selected in the mini batch, are they discarded or? No, so thanks for the question. So you keep a large replay memory. And basically, every tupper that is in the replay memory ends up several times, at least in expectation, in these updates. So basically, at every step, you would store one tupper state action reward next state in your replay memory. And at every step, you will also do an update based on a mini batch taken from the replay memory. So usually, for the vast majority of the tupper stored in the replay memory, there will be a sample several times for these updates. So they are reused several times for the updates. And that's also an important part for the sample efficiency. You reuse several times every tupper that you have observed. Other questions? And in the replay memory, do you keep all data always, even if the training is very long? Or do you discard some data over time when the training is very, very long? So usually, in the replay memory, you will have a finite size. So still, for the replay memory, maybe 1 million tuppers. And you will maybe use something like FIFO, like first in, first out. First in, first out, yes. So basically, the oldest tuppers will be out the first. Any other question? So basically, you will also have the opportunity to actually, I think, implement exactly this DeepQ network during the practical session. And I think you will also have a recall of all these aspects during the afternoon session. And in the second part, let's say, of this presentation, we will look at the additional things that you can look at to improve further DeepQ networks. So that's what we will look at. And first, we start with the right deep learning architecture. So there is a rule set of potential modules that you can combine to make up a neural network. This is true for any field of machine learning, like supervised learning and supervised learning. And it's also true for reinforcement learning. So let's just look at a few of them. For instance, when you work from images, usually you will work with conversional layers, because they are really good for introducing the right inductive bias and learn efficiently. So you will usually combine. And that's basically what the DeepQ and paper did. So they use conversional layers and then fully connected layers to learn the Q values. If you have POMDP, so for instance, you have a need and dynamics. And at every time step, you only get an observation from the state. And based on these observations, you need to take some actions and you get some rewards, et cetera. But basically, the true states are hidden. Then in that case, because you need to take into account a rule history of observations, actions, rewards for pseudo states. So basically, it becomes a time series. Usually, recurrent neural networks work well. But you can also use CNN, even though there are some specific advantages for errands. Then you have the transformers architecture. So if you have heard about, and I guess you all do, have heard about systems like charge GPT or many state of the art algorithms, they use transformers architecture. And in reinforcement learning, it has also been used. In particular, in this Gator paper, they use, they learn policies kind of separately for many different domains. And then they used one big transformers architecture to learn tasks that are from different domains, like image and questions answering, images and proprioceptions for robotics tasks that are initially learned with robots, Atari games that are initially learned with reinforcement learning. You have also text-based games. And all of these are then kind of distilled into one big transformer architecture that kind of tells you what action you need to take for any of these potential inputs. And the goal is to kind of create something that is kind of more of a generalized agent. And so a transformer is a deep learning model that adopts the mechanism of self-attention, differentiating the significance of each part of the input data. But I will not go into the details of what the CNN is, what RNN is, what transformers are, et cetera. Any questions? Yes. Yes. Yes. But we also asked once. Yes. Doesn't that kind of... So the idea with deep key networks and Q learning actually in general is that basically it is off-policy learning. So the goal is to be able to learn from any off-policy data. So in that sense, ideally you would be able to learn from more or less any distribution as long as it covers your whole state action space. But indeed, kind of the kind of information you keep in your replay memory can have an impact on how efficient your updates are. For instance, yeah, if in your replay memory you only keep, I don't know, the very last transitions that are obtained with a very bad policy that is not well explorative, for instance, then you will not be able to do nice Q updates either. But fundamentally, I think the most important is the goal with Q learning is to be able to learn from any off-policy trajectory. So even if one million steps before, your policy was very different than the QN policy provided by your Q network, you are still able to make some use of these tuples. Yes. The importance weights. Yeah, so here we use the semi-gradient updates with the Q learning rule. So it's not indeed kind of the full gradient possibility that you can have, but it's the semi-gradient. Right? Maybe I can add which state action pair you're updating. There's an action that comes from somewhere. Here's two actions involved. You are updating a certain state action pairs. We have that action, let's say the A at times of T. And you also have the action at times of T plus one, which is in the target, right? Now what you would do with importance weighting is not correct for the visitation frequency. So not for like the distribution of which states you're updating or which state action pairs you're updating, but you're only correcting for the action that you're taking in the target. And in Q learning, the action that you're learning in the target doesn't come from the behavior policy like it does in Sarsa. So the action, which action you plug in for the target, it's already this maximizing action, which is also your target action. So that's why you don't have to correct. Yes. For like Q learning or DQN, can you make it multi-step? So or like use some kind of like TD lambda scheme? Yes, and we will look at this in the coming slides. Oh, okay. Yeah, so that's indeed, so basically here we looked at different deep learning architectures, and then basically here's what we recover for the coming slides. And one of these is multi-step learning. Yeah. Any other question? So like, so has there been any investigation on the distribution that you used to sample the mini batches? Like if you do it, say uniformly versus, I don't know, like give more weight to the recent samples or something? Yes. So there has been some work about that. One of these is prioritized experience to play, which actually I do not cover today because of lack of time, but you can indeed try to find smart ways of sampling the tuples from the mini batch so that you make the learning more efficient. Yes, for either by trying to maybe use more of the recent tuples or by looking at the ones that have the biggest bellman error. And then you try to sample these more often. But when you do this, you must also be careful because yeah, there might be some drawbacks, but this is definitely useful in some settings. Yes. Well, yes, so you can try to indeed keep only the most informative tuples in your replay memory. If you get a tuple that is already maybe somewhere in your replay memory, you could try not to put it in the replay. So there could be some ways to do that, but this will not completely solve or not actually aim at reducing the instabilities directly. It would aim at having, I mean, yeah, somewhere it could potentially help as well, but it would aim at kind of getting a better coverage of your wood state action space of interest. Yeah, so basically you're trying to get the most informative tuples in the replay memory, whether it is by exploration, expectation, or by also maybe some more selecting how you samples from this replay memory can be useful. Okay, so now we will look at these different things. So these are all different techniques that have been used successfully to kind of improve the learning kind of on top of deep Q network. So basically we are still for now in the model three value-based Q learning-based results, except distributionally, which can't be fully called Q learning-based, but we'll see that in a moment. So double DQN, the goal of double DQN is to decouple two different things. So when we look at the usual DQN update, we calculate the best action and we take the estimate of the best action from the same target network, right? But you could take, and that's what double DQN does, take the best action based on this target network that you have frozen, but then use that best action by using the value that is actually given by your current Q network, right? So you kind of decouple the selection of the best action and the estimation of the Q values at that best action. And by decoupling these two things, you can improve performance further as compared to DQN. So basically you kind of, thanks to this, you kind of reduce the overestimation bias that appears in Q learning and you reduce the instabilities and the risk of divergence. So this is called double DQN and it has been applied successfully on top basically of DQN, okay? So more better stability and improved learning when you use double DQN. So that's one way to try to improve further the DQN algorithm. Is it clear? So multi-step learning, so that was the question from five minutes ago. In the usual DQN, we use the one step looker at target, right? So we look at, for a given state action pair, we look at the reward we get and where we end up in at the next state and we look basically at the maximum expected return that we can get from that next state. But why couldn't we look at multiple rewards that we get in a given trajectory and then cut the trace only after multi-steps? Well, actually we can, that's called multi-step learning. This is only one version that is called end step target. You can also have TD Lambda, which basically gives less and less importance to the end step targets for bigger N basically. And when you do this, you can get much faster information about future rewards, right? But the thing is that you require online data for convergence result bias, right? Because you use, this is not anymore enough policy update that will convert result bias. And one way to kind of get the best of both worlds is you use end step targets with N relatively big at the beginning of the learning and then you can decrease the N through learning. That's one way to kind of try to get the best of both worlds. You get better stability and faster kind of information update in your Q values and at first by having a big N and then by decreasing N afterwards, you reduce the bias due to the fact that the data is not obtained via the best policy always, at least usually with some epsilon greedy policy. Okay, so that's, yeah, the question. Just coming back to the important sampling question, like here you would need some important sampling because you take multiple actions with a different policy. Yes, so that would be another way to kind of fix this bias. It would be to try to reweight the rewards that you get so that you indeed, based on the behavior policy, you can still fit a given target that is different than the behavior policy. So that would be another way to do it and to keep a kind of end step learning with reweighting. Yes. Is there less common to do, you think, like with important sampling? Yeah, the thing is important sampling also has some problems because it is high variance. If your behavior policy and your target policy are very different, these weights that you add kind of can become very big or very small and you have a big variance in your target estimate. So basically your target estimate might be unbiased, but you will have kind of sometimes it will look very low, sometimes look very high, right? And you try to fit kind of the expectation, but by learning from a high variance target, you have some more difficulties, right? Than having like directly the mean given to you for free. So it's one possibility, but it brings other drawbacks. Right, thanks. Other questions. Yes, so instead of sampling state action reward next state, you would sample basically this, but for end steps, right? So states action reward next state and then action at the next state, et cetera, until next step in the future, and you sample that from your replay memory and you use it to calculate your target. Yes, so within a given trajectory, you will instead of only picking state action reward next state, you will pick a broader part of the trajectory. What do you mean by different features? Oh, theta. So theta are the parameters of the estimates of the Q values and then for every like after ends, the ends state in your small part of the trajectory, that's why you will estimate the Q values and here you only use actually the rewards up to the ends state. So I'm not sure I understand. So here the parameters theta are not dependent on exactly how many N is there, right? So basically the Q network will be the same. It's just the target that is different based on the fact that you sample more than state action reward next state and then based on the fact that your target here is different. So the theta are the same. Basically that's why you can also start with an end step target that is N quite high, let's say 10 and you can reduce N to maybe even one when you want to try to converge without bias. So and you keep the same Q network. It's just the target update that is different. Okay, so another thing that you can use to improve convergence is to use the discount factor. So I've already mentioned that the high discount factor makes it more difficult to converge. Usually you want it to be as high as possible, right? Because for instance in Atari games if you care about the sum of rewards basically until you die in the game, actually you would want your discount factor to be even one. But when you're trying to learn with a very high discount factor, it is very difficult to make the deep Q networks work. And basically that's why you use the discount factor smaller than one. And what you can do is to start with a lower discount factor and increase it through learning. And this is also similar to maybe how animals and humans would learn. So there are empirical studies of cognitive mechanisms that says that only children starting from ages three or four are able to maximize their rewards after 10 or 15 minutes, right? Before three or four years old, children tend to just maximize their very immediate rewards. Very myopic kind of policy, right? And only when you get older and have gathered more information about the world, you are able to kind of try to optimize your horizon. You are trying to optimize your objective for a longer horizon. And so this is slightly similar in deep Q networks. If you directly set your discount factor for instance at 0.99, you have a learning in this game of Qbert that is relatively slow. And you also see some instabilities here in the V-value function. So here, the V-value function is calculated as the mean estimated V-value function through the trajectories you go through. And you see that it actually somehow overshoots and then goes back down towards something that looks like more or less convergence, but there are still a lot of things happening because the score is still increasing. But you know, it looks like here, the expected return should be better than here, at least from the Q values that are learned. But in reality, this is not the case, right? So basically you have a lot of instabilities, a lot of overshooting. You try to learn from relatively few data and a high discount factor and things go not so well. If you use a discount factor that starts with a relatively low value, for instance, 0.95, and you increase it to a discount factor of about 0.99, the learning is much quicker and you have a much more stable learning of the value function as well. So that's also one thing you can play with is the discount factor at training time. Of course, not of course, but at this time, you usually still have a non-discounted return, right? So here the score that is written here is just the sum of rewards over your episodes until you die, non-discounted. But at training, you use a discount factor and you can play with the scheduling of this discount factor, okay? So still another trick that you can be interested in is dueling networks. So the idea is not to directly learn the Q values, but actually learn two different objects of interest within your Q network, which are the advantage function here and the V-value function here, and the advantage function is defined as being the difference between the Q-value function and the V-value function, right? And by kind of having some parameters theta that are shared between your V components and your advantage components and by merging afterwards your V-value estimates and your advantage function estimates into the Q-value function estimates, you can kind of enforce that within your learning you learn the V-value function and the advantage function. So that's called dueling network. It gives you slightly more information directly available in your network and it can also somehow in some context improve learning. Okay, yes, question. So this light blue line is the V-value function averaged over the states that are visited in the test trajectories. So basically while your score is this one, your discounted expected return was estimated to be this one, right? So basically you are even, you are clearly overestimating your true return, even though you are discounting it. So you take your rollout at test time where you try to calculate the score that is a non-discounted, so you just sum the rewards and at the same time you look at the Q-values that are estimated from your Q-network and you kind of look at these values and you average the V-values. So basically the max of these Q-values over your trajectories at this time. So there's an improvement. Yeah, basically indeed, so you see two things. You see that the score increases faster and you also see that the expected return, estimated, so your V-value function or via the Q-values, via the Q-network that are estimated are lower. And basically it seems that they are clearly much more in line with what you expect because here you clearly have some overshooting of the estimations. While here it seems that it is actually learning something very close to the true optimal Q-values for that given discount factor. So for a given time step, for instance 0.97 gamma, you get this estimated of the expected return given by your Q-network and this seems much more in line with what you actually expect without overestimations. And basically this leads to faster learning and in the end you are still interested in having a high discount factor, right? Because if you keep a low discount factor like 0.95, for instance, you have a high bias in your learning, right? You're optimizing something that you do not actually care about. You're optimizing the immediate rewards while there are maybe policies that are better with longer horizon. So another thing that is important is to fight over spitting and lack of plasticity in deep learning. And this comes, the problem starts from the fact that in your replay memory you sample many times the same tuples, right? And because of that, there is a risk of, first of all, forgetting the past. Everything that is not in your replay memory anymore, you have a risk of forgetting it, right? Because you do not sample it anymore and you kind of overfit on what is currently in your replay memory and the second risk is that you have a lack of plasticity. So it's been shown that when you learn, for instance, different tasks one after the other, at some point your neural network is not able to learn anymore, right? Because it kind of takes some, in the manifold, the parameters take a given values and then it gets difficult to get the gradients and all the Q updates to work correctly. So you have a lack of plasticity as well. And some solutions to fight this catastrophic forgetting and this lack of plasticity is resetting some layers to random initialization. So it feels like a really brutal approach, right? Because you're kind of erasing completely the information you have been learning for at least some of the layers. But it can actually help and you can also try some specific activation functions to try to deal with this. You can basically also play with all the hyper parameters of your deep Q network. You can play with the replay ratio, like try to get a bigger replay memory as well or something like this. But these two are also two solutions. Yes, question. So is this problem with the overfitting and then not changing any more the weights typical for offline situations or also on policy algorithms or reinforcement learning algorithms who face the same issue? This is, yeah, most of them would also queue if you use something else than queue learning. Yeah, so basically it's more related to the fact that, yeah, you have a limited data set that you currently keep in your replay memory, right? And because of that, if you re-sample too much from that, you will only fit that, right? And when you actually overfit in deep Q networks, you also increase the potential problems with actually not converging at all. So like divergence and overfitting are also related. And then the lack of plasticities, if you are learning too much on some specific data and then you need to change kind of the target for new tasks, et cetera, it becomes difficult. Thank you. I just wanted to ask, why is overfitting bad in reinforcement learning? Because in supervised learning, we normally overfit to the training data set. Yes, so you can somehow overfit, right? But if you overfit too much, whether it, even in supervised learning, if you overfit too much, your actual generalization on a test set that is different than training set will get worse, right? So this problem of overfitting is definitely also present in supervised learning. And in reinforcement learning, in addition to kind of overfitting and maybe lack of generalization, you might also have problem of convergence, actually. Because if you are not able to estimate accurately the Q values at the next state, the max of the Q values at the next state correctly, then maybe you can also propagate some errors due to this overfitting. Other questions, yes? Yeah, so my question is regarding resetting the layers. So you're saying it helps learn stably, but when you're resetting the weights of your layers, are you not going to see a drop in performance immediately and wouldn't that hurt your whatever application that you want to use it for? Yes, yes. So indeed, when you reset the weights, at that point it gets worse, but then you keep training and then you kind of get something that has the potential to get even better than if you did not reset in the first place. But indeed, at the time where you reset, it will most likely lead to something worse right after, but then you keep training. Yes. So usually, these reset of the weights will kind of propagate in the target network, right? But yeah, that's the idea. The moments that you're in, are you late? Yeah, you can keep the target network at first the same, and resetting only the one that you kind of update instead, but then at some point, you will also kind of have the effect that is translated into the target network. Yeah, so like the idea is to make sure that you do not get some activations that are, for instance, with really activation function, you can easily get activations that are zero, right? And through training, you will maybe get something that is sparser and sparser. So at least I can give you a counter example that is easy to understand. RELU is not always ideal in reinforcement learning, specifically in this kind of more continual learning scenario, right? And so you can try to have like some other activation functions that have better properties, or some regularization of the weights, or so it's kind of still a very kind of research area. Yes? Yes, so basically the idea is that at some point, something more or less bad kind of happens in the learning, and then by resetting, you kind of give again this plasticity, so this ability to learn new things to your neural network, so that's the idea. You kind of reset it to initial weights that are kind of nice to learn new things, right? You can think of it as, you know, you have kind of really fitted everything, and then basically maybe, for instance, the gradients cannot flow anymore through the weights because it's kind of as converged somehow to a local minimum, and every new thing that you try to make it learn will kind of fail. And then you reset the weights so that you start from a new function approximator from which it is easier to learn. Yeah, you change your T-tap, your parameters of your neural network, right? At least for some parts of the vector theta, and basically that indeed brings you to basically another place of your function approximator, so the outputs of your function approximator for a given state action pair will be different, and then you will learn from that. Yes? The logic, which logic exactly? Yeah. Yeah, that might help. That's definitely something to try. The thing is reducing, again, the discount factor will maybe not affect the plasticity, like the ability to learn new things from your neural network, but it might be that somewhere in the optimization process that helps. Yeah, but at least, it's something maybe to think about, but at least for like starting from a lower discount factor, because you have less data, more risk of instabilities at the very beginning of the learning, et cetera, and then through learning increase the discount factor, so that in the end, you decrease the bias. This is kind of relatively logical from a machine learning point of view, and even from the motivation from neuroscience. Trying to reduce the discount factor, increase it again, that might have other motivations, but yeah, something to try. Another question. You said that the problem is overfitting to replay buffer, and what if we just rather change the data in the replay buffer, like maybe it's more promising? Yeah, indeed, so another way to kind of try to somehow tackle these two problems is, for instance, reduce the replay ratios. So basically, you try to gather more data and learn less, let's say, from any data that comes in the replay buffer, but then you become less sample efficient. So that means you need more tuples from the environment to learn. But indeed, one reason also why we have these two problems, it's not really the only one, but it's because we replay several times the same tuples. And if we replay them too many times, then we make these problems worse. Yeah, if you have a better exploration policy to gather data in your replay memory, that definitely helps. Yeah, sure. So that's indeed something I do not cover here. I think there will be talks also on exploration. But yeah, indeed, you can only learn from the data that gets into the replay buffer. And in deep Q network, that's actually something I might not have mentioned, but it uses an epsilon greedy policy. So basically, epsilon percent of the time you take a random action, one minus epsilon percent of the time, you follow what is currently thought to be the best policy, right? But if you have something that's smarter than this, then maybe you collect more interesting tuples and then the learning is better. Okay, so let's see. Okay, so the official time is over, but I guess I have more than 15 minutes left. 15 minutes left. Okay. So another technique than Q learning is distributional DQN. So instead of trying to fit the Q value function, you could, and basically the Q value function gives for one state action pair, it gives you the expected return, right? It gives you basically one number for a given state and a given action, it gives you one number. That kind of gives you the expected return. But you lose some information, right? This does not give you information about the rule distribution of return. When you are in that given state, and you take a given action and then follow a given recipe, the expectation gives you only one number and it does not cover the full distribution of the returns, right? And there are techniques that actually aim at learning directly this full distribution of return. Basically, you get something that is similar to the Bellman iterations, but from a distributional nature. So you need to look at here the full random variable of the reward and the full random variable of the expected return from the next time step on what. And from that, you can do the updates. And basically, if you do this, one nice thing, so basically there are maybe two nice things with this distributional learning. First thing is that you can have risk-aware behavior. So if you actually care about is, for instance, never having a return lower than zero, well, by looking at the full distribution of return, you can select a policy that is maybe not the best in expectation, but one that makes sure you will never take a policy that will get maybe only 1% of the time a very bad return, right? And you can, so basically implement some risk-aware behaviors. So that's one thing you can get from this distributional DQN. And one other thing that you get from it, in many cases, is that you can get more performance learning because basically you try to fit the whole distribution of returns and you have some problems that are slightly less problematic than in DQN. Let's say it that way. And you can also kind of relate this to the effect of auxiliary tasks. So basically you learn something that is more informative. You learn the full distribution of returns instead of only the expectation and by learning something that is more informative, you might also help the learning, basically. Okay, but basically in practice, that works well. Okay. And so some state-of-the-art results from the past years. So in 2015, there was this DQN networks and then here it's on the benchmark that is the 100K Atari benchmark. And you see that you need it about more than six million, so because it's time 100K, more than six million steps to reach some kind of metric that corresponds to human level with DQN. And nowadays, with many of the tricks that I've discussed today, this paper managed to get to about 100K steps needed for human level. And the ones that are in blue are actually combining model-based and model-free. So if time permits, I will give a very small information about all the model-based, but you will have more information about model-based techniques in other talks. Maybe some real-world examples. So you have this example where they use deeper enforcement learning to make a stratosphere balloon navigate. So for instance, here you see that at a given altitude, the wind tends to go in that direction and at another altitude, the wind tends to go in that direction. So depending on where you want to land, well, you can kind of go higher in the sky so that you go in the direction you want and then you can go down more or less slowly, faster. And that way you can kind of decide the trajectory that you will get for your stratosphere balloon and this problem has been tackled with reinforcement learning. Other types of problems that can be tackled with reinforcement learning is the optimization of the control of micro grids. So let's say you have a micro grid that is connected to the grid with some solar panels, some loads, some storage systems. By learning from trial and error, you can also, your deeper enforcement learning, manage to learn. For instance, if you have short-term storage with batteries and long-term storage with hydrogen, you can try to make the best of these storage devices. So you can use here to us use convolution to treat the time series. Recurrent networks is also an option. And basically what you see is that as compared to naive policy, obtain here for some sizing of your micro grid with deeper enforcement learning solutions, you get better results and you are getting close to the optimal that is if you actually knew everything from the future. And here the green line as compared to the two other ones is when you can provide, in addition to the past observations, when you also provide some solar prediction to your neural network and then your neural network without knowing that it is a solar prediction can kind of make use of that information to estimate the expected return and from this estimation of the expected return get the best policy. So the more information you give, usually if it's really informative features will allow your reinforcement learning algorithm to learn better. Okay, any questions on these real-world examples? No? Then we have some remaining slides that I will try to cover relatively quickly about improving further generalization in deeper enforcement learning. So here we have many covered model-free deeper well. And what you want usually in reinforcement learning is really to improve generalization. So learn from as few samples as possible, right? And what you can do for that is look at an abstract representation that discards non-essential features either by feature engineering or by some automatic methods that will do abstract representation learning. You can modify the objective function. So we talked about the tuning, the training, this confactor. You can also have some kind of reward shaping. So if you are able to provide faster feedback about what the right action is basically to simplify a lot what reward shaping is, it's better to provide the reward as soon as possible to your agent because it gets easier for the learning. Then you have also the learning algorithm. We looked a little bit at type of functional approximators. We looked mainly at model-free but in some context model-based can also be very interesting. And as was also mentioned in one of the question, whenever you can improve the data set, so basically the way you explore your environment, well, that's something you should always think about. It's not only about how you learn but also from what you learn, right? The data set is also obviously very important. So now I will talk a little bit about combining model-based and model-free and also the importance of abstract representations. So in cognitive science, there is a dichotomy between two modes of thought. So animals and humans are kind of described to be thinking in let's say two different ways. There is a system one that is kind of fast and instinctive and that is related more to value-based learning and policy-based learning because from the values, the Q values and from the policy, you can directly act in the environment. So it's kind of very fast. It does not require a lot of thinking. And then you have a system two that animals and humans have that is slower and more logical where you kind of look right, right? If I do this, this is what's going to happen and then I will be able to do this. Or when you do some proofs in mathematics, right? It's kind of, you need to look right to see whether you arrive at the objective that you want. It's not fully instinctive, right? So basically there is a dichotomy and in reinforcement learning, we can somehow also relate this dichotomy that appears in, that is described in cognitive science with the model-free and model-based approach. So the model-based, for instance, usually is used together with planning and this can be related then to this system two approach. Okay, then we have the choice of the learning algorithm and functional approximator, but we have more or less covered this already. So maybe one last thing before the questions is the importance of the abstract presentation. So in deep learning, you have this structure of an encoder that can reduce the complexity of the input into a latent or abstract representation. One technique to learn this representation is the autoencoder framework where you try to rebuild the input at the output and then you can use this latent representation as an abstract representation. So that's one way to do it. In reinforcement learning, using autoencoders has been used somehow successfully, but there are other techniques that really make use of the dynamics of the environment to shape the abstract representation beyond pixel similarity because autoencoder techniques really look at pixel similarity. For instance, if the background of your environment changes, you will probably take this information as the most important one with an autoencoder framework. While if you use other kind of techniques, you can get, for instance, these results. So here, if you are in a maze where your agent can go up, down, left, right, well, you can get an abstract representation that kind of looks like this where in only 2D, so basically you have two neurons for your abstract representation, you can get something that kind of is much more natural where you can really see the meaning of the action up. Well, here, actually, there is some rotation, so it's not actually up this action, but you get the idea, right? So basically here, you can really see the meaning of all the different actions, okay? So that's abstract representations in reinforcement learning can also really make use of the dynamics of the environment to shape as well as possible the representation. And for instance, in this capture environment, you can get something that looks like this where you can also get basically the three features of importance. So for instance, here, you have the position x, y of the ball and the position x of the paddle, and you can kind of retrieve this information in an abstract representation as well. And then you can also do planning. Ideally, you would also do planning in the Latin space representation, such that planning is efficient and generalized as well. So basically you start from x at zero, that is your abstract representation after the encoder in the reinforcement learning setting, then you expand the most likely interesting actions based on your Q network, for instance, up to a given planning depth, D, and then you back up this information that make use of both the planning and the estimates of the Q values into your current state. And with this, you kind of combine the information taken from a model and the information taken from your Q values. Okay, so I guess now we can go on to the conclusions and potentially questions. So we looked at Q learning in the tabular case, Q learning with deep learning and the deep Q networks algorithm. We looked at different variants of model 3D parallel, some real-world examples, and then we'll give some links with all the talks in the week about model-based and representation learning, and if there are any questions, I'm happy to take them. Yeah, yeah, so basically here you can see that's like going, for instance, so here you have all the blue crosses corresponds to states of the agent that are on the left, right? And you can see that going all the time in a given direction will always increase, for instance, one of the features. So basically you can really understand what a given feature mean and what a given action mean. Means, yes? So question about learning representation. So what do you think about pre-training the learning representations versus learning with the reinforcement learning parts? What are the pros and cons in this aspect? Yeah, so learning with autoencoders can be very stable and it's kind of well understood, et cetera, but the drawback is that it's based on the pixels values, right? While potentially if you look at other techniques, you can really forget about what takes most space in pixel space, but only look at what is important for the dynamics and even for the rewards that you are trying to optimize. So it can be even task-specific representation. While you cannot have this, if you pre-train fully based on only states that you obtain in your environment and you do not look at the dynamics itself, then you lose some information that might be useful for the abstract presentations. Okay, so if there is no more questions, I think we can stop here and go to lunch. Thank you very much, Vincent.