 And for our second lecture now, we will transition smoothly to reinforcement learning and see a range of reinforcement learning algorithms. And for these, I'm happy to introduce our next lecturer, Olivier Piacan. Olivier is a research lead at Google DeepMind, where he leads a team working on reinforcement learning but also other topics like imitation learning, language processing. Olivier has done a wide range of works over the year from theory to deep parallel. So I think he can give a really great overview of the field. So yeah, let's welcome Olivier. Thank you. Yeah, thanks again. I will start as Bruno by thanking the organizers for the invitation. It's always a great pleasure to come in Barcelona, speak about reinforcement learning. I actually heard a sailor saying yesterday, and I thought it would be a good introduction to this talk. It's basically, well, it was in French, but it basically says that there is no good wins for people who don't know where they want to go. And I think this is something that will probably drive a bit the level at which my lecture will be. So it will be not as technical as Bruno's one, because from the beginning, my PhD thesis 20 years ago was about using reinforcement learning to optimize chatbots from human feedback. So back at the time, and for the last 20 years, I looked like the crazy guy who wanted to do something with Aurel. And I know things have changed, but I'm still trying to do something with Aurel, and I want to have intuitions about why it works and why it wouldn't work. And I've been very lucky in my career to meet very soon with Mathieu Geist, who has been my collaborator for almost all that time, since 2006, I think. It's now my colleague in Google. And together with it, indeed a lot of theory, but also a lot of deep Aurel things that I meant to be used. And so this is actually the level at which I will speak about Aurel is about what are the concrete algorithms, how do we do in practice and get some intuitions of why from the theory this should work, actually, but not going into the details. But you'll have a lecture next week by Mathieu who will tell you everything about regularization, for example, in MDPs. And you'll see that all these things are sounded. So I will start with a short reminder. Of course, I don't have to remind you from a long time because it was just this morning. But I thought it would be good to have this very short reminder so that we can set the notations, et cetera, in my framework rather than Brino's ones. That's something also that you will learn with time. Aurel community has many different ways of naming things. States, actions have different ways of being noted. But yeah, as long as you understand what is what and why we named things in different contexts this way, it should be all right. So same as Brino as well. If you have questions along the talk, please raise your hand, get a mic and ask questions. So what you got this morning was about how to solve a problem when you know approximately everything. Like you know the reward function, you know the transition kernel, you know everything, you know the state and the action space. You don't discover anything. Everything is given from the beginning. And then you try to solve the problem to find the optimal policy. Reinforcement learning is a bit different. Reinforcement learning is a problem. First, I want to say that it's a problem. It's not a solution. It's not a method. It's not a way of seeing things. It's an open problem that is not solved yet. And for which we have a bunch of algorithms that try to address the problem, but it's still a problem. And the problem is the following. So we have an agent. So the agent here is the mouse here. And this agent actually interacts with an environment. So of course, there is a bunch of offline RL algorithms, but the original RL problem is purely interaction. Like you don't get any prior knowledge about anything. You just interact with the world. And the way you interact with the world is by doing actions, by making actions in that environment. So for example, for the mouse here, it's going to be moving to the maze. The environment is the maze. It's going to move into the maze. So it's going to have actions like move forward, move from the left, move on the right. And these are actions it has to take given a representation of the environment that the agent makes from its interactions. It doesn't have a proper representation of the environment before it interacts with it. So it is an internal state that is supposed to be a representation of the environment built along the interaction with the environment. And to build this internal state, the agent receives observations. So every time it does an action in the environment, it receives observations. So for example, in the maze, it can be very, very accurate information like your x, y position in the maze. But usually you don't have this x, y position. You have observations of the walls you've seen so far. And you try to remember the path you followed so far so that you build a representation of your state into the partial vision you have from the environment. And each time you get to do an action in that environment, the environment tells you whether it was a good action or a bad action. And it's often according to some objective that you try to follow. So for example, here the mouse tries to find some cheese in the maze. And it will get zero reward all along the interaction until it gets a piece of cheese. Then it gets plus one, for example. So the reward is a scalar. And it just gives you an information about how good locally your action were in the environment. So it's not about have you reached your goal or not. It's really whether you have done something good in the direction of that goal locally in the environment, given the state you are or the perception you have of the state. And then it has internal parameters, which we call hyperparameters. It can be the gamma factor that Boineau told you about this morning. But it can be also a learning rate, for example, because maybe you want to actually learn fast. And then you have a very high learning rate where you want to be careful and learn faster. And don't forget what you've learned before. And you have a small learning rate. And so these things are also things that hyperparameters probably today, if you're skilled in RL, your skills lay most of the time in the knowledge of hyperparameters that you need to use for a given task. So just giving you this slide about MDPs, just to say that I will not call state X, but I will call state S. Well, that's my choice. But also here, you can see that there are different schools, let's say, in reinforcement learning. Those were coming from maths. Those were coming from control. Those were coming from computer science. Those were coming from neuroscience. And the S probably comes from neuroscience. So it's the state S and action A. In control, you would see X and U, not X and A. But U would be a continuous control command. So here I will choose S and A, S for state, A for actions, and T for time. So you would also see S for time in some books because of steps, S for steps. So here it's going to be T. And the reward is going to be a small r. And the transitions are noted T for transitions. So transition caramel. So I try to make it simple. But of course, when you come from another talk that made all different notations, and I'm the one who is making all the other notations, I'm sorry about that, and I just wanted to fix this. So now with these notations, I will define again the value function. And I will define a new value function, which is the Q function, or the state action value function. So the value function, as Bruno mentioned before, is the expected gamma discount in return that you get when you apply a policy pi and start from a state S. So you've got a fixed policy that will not change. You want to evaluate this policy for again. So you want to evaluate this policy. And the way we evaluate the policy in reinforcement learning is to compute the average expected return, sum of expected, the expected sum of discounting return that you get along a trajectory. So you suppose in this talk, I will suppose the horizon to be infinite. You always interact with the environment. This is also something that I should have mentioned on this slide. The original RL definition, RL problem definitions doesn't suppose that there is a final step. You just interact forever with the environment until you die. But you don't know when you die. So let's say that there is an infinite horizon. So here I will consider the infinite horizon discounted problem, which is the one that most of people use in practice. Even most of the time, even if the task is episodic, people will use this algorithm derived from these notations. And now I need to, for the rest of this talk, I need to define a new function, which is the Q function or the action value, action set value function, state action value function. So the difference with the value function is that you have an additional degree of freedom on the first action. You actually can choose the first action however you want. And then you follow your policy pi. So what is the goal of this Q function? Well, it's meant, again, to make these things that we've learned so far practical. Usually, you don't know what the model is. You don't know what are the transitions. What is the transition kernel? What you can observe is just like the action you did, the state in which you landed and the reward. So if you, for some reason, manage to get the value of the next state, so following your policy, you can actually tell with the Q function whether an action is no better than your current policy. Because if an action has a Q value higher than the value of the policy, then you should actually take this action rather than the action set by the policy. You'd better change your policy and switch to that action if the Q value of that action is higher than the value of your policy in that state. So that's a way to kind of handle policy improvements without having to compute the whole, without having to run a dynamic programming algorithm. So with these notations, I also wanted to recall the Bellman equations that I will use in the rest of this talk. So the Bellman equation for the Q function, it actually relates the Q function to the value function. What does it say? It says actually that the value function would be the average over the probabilities of taking actions in a given state. So that's the policy of this, which is the value of the next state. While the Q function, you don't do the average anymore because you have selected one action. So you just remove the dependency on the policy because you actually have made a single choice of the action A. So these two things gives you the ways of evaluating a policy for each state or each pair of state and actions. It's actually set a pair of linear equations systems that you can solve if you want. Well, if you use a TABRA representation, so if you can actually store the Q function exactly and the value function exactly, then you can actually solve these two linear systems and you have the value. You can compute exactly the Q value and the V value. So that's for evaluation. If you actually want to do control, so if you want to find the optimal policy, you have the equivalent of these two equations for V star and Q star. So V star is the value of the optimal policy and Q star is the Q value for the optimal policy. And so you end up with these two equations. So if you actually want the value of the optimal policy, you just replace the probability by the choice of the max, the action that will maximize this quantity. So that's the value of the optimal policy or the optimal value because there is only one optimal value. So this gives you a recursive equation again. So V star is defined by V star of S prime. V star of S is defined by V star of S prime. So it's a recursive equation. And you have an equivalent. You just have to replace in here V. And you get an equivalent for the Q function. So now you can compute the optimal value, V, and the optimal Q function, Q star, for the optimal policy or one of the optimal policies. So good thing, as I said before, is that now that you have a Q function, you can say that you can actually derive a policy directly from that Q function. If you had a value function, you cannot derive easily a policy out of V without knowing the model. You don't know where you will transition. If you have the Q function, now you just have to say that the argmax of the Q function is giving you the optimal action for the state in which you are. So that's the whole purpose of this Q function is to give you a way to compute the optimal action, the optimal policy, out of the function. You just have to compute this. And now I will just end up with the two algorithms that Bruno presented before. So value iteration is actually telling you that you can actually compute the value of policy just by noticing that this operator here is a contraction. So I guess everyone knows what the contraction is. Otherwise, this morning was a bit confused. Anybody doesn't know? OK, everybody knows. It's a cool. So using the fact that it's a contraction, then you actually well, then this is actually a contraction as well. So knowing that this is a contraction, you just, for the value iteration algorithm, you just compute the series of iterates of the value that starts from a random initialization. You apply n times this operator to your initial value. And when you are close enough to stability, you stop. And then you take the optimal policy out of the value that you have computed. Policy iteration, that computes the value of the optimal policy. And you get the policy out of the value function. Policy iteration is you have a first evaluation phase, where you also consider that this is a contraction. You apply this n time to a policy pi 0. So you evaluate your initial policy. Then you take the r max to get the next policy. And you do that until the policy doesn't change anymore. When it doesn't change anymore, as proved this morning, it means that you have reached optimality. So that was the recap of this morning with my notations. And now I will start with the RL algorithms. Any questions so far? OK. So know what is our problem. Our problem is that with policy iteration and value iteration, you still need it. As you can see, you still needed the knowledge of the transition kernel and the reward function to solve the problem. So you also needed this kernel in policy iteration to extract the policy from the value. So these usually you don't know. When you interact with an environment, you actually don't know whether, given an action, you will end up in that state or in another state. And that's because most of the environments are stochastic. So if you don't know the transition kernel, you won't be able to run policy iteration, value iteration, or extract any policy from the values that you got in policy iteration. So one way to solve this, which is the most naive one, is to use what you interact with the environment and instead of computing any value or any return or whatever, you compute the transitions, the probability transitions that you observe during the interactions with the environment. So every time you move from one state to the other, given an action, you increase the counter and you divide by the number of interactions you made with the environment in that state, for example. And it gives you the probability of going from one state to the other by after doing an action, after making an action. And then now you have a model of the world and you can actually run policy iteration on value iteration because you have approximate values for the transition kernel and you could also observe the reward function all along the trajectories. So this is called adaptive dynamic programming. It's a way of doing model-based reinforcement learning also. So model-based reinforcement learning would be an ARAL method where you either only learn a model or only learn the transition kernel and the reward or you would both learn value functions and a model of the environment. So these are two classes of ARAL algorithms, model-based, model-free. So this would be model-based algorithm where you need a model, you learn the parameters of a model, and you try to plan into this model using DP, for example, the dynamic programming. But if you don't want to rely on a model, you can actually just rely on samples. So that's the other extreme. You just rely on samples to approximate what you want. What we've seen so far is that if you have a value function, you can extract a policy out of the value function. So what if we learn directly the value functions from interaction? And this is so-called Monte Carlo methods in ARAL. So Monte Carlo is always about samples, learning just from samples. And here, the samples are trajectories. You just place yourself randomly on any state. You start from that state, and then you play your policy. End times. And from these end times, you will get a total return for each trajectory that you generated with this policy. And this total return, you average over all the trajectories that you have generated with your policy. And that gives you an estimate of the value function for the state in which you started. But of course, there are many caveats to that. The first one is that you need to be able to start from any random state, which is not always the case in robotics, for example. You cannot just put your robot in some random state and expect to run a policy from there. There are plenty of states that are not that you cannot initialize your system from. But also, if you look at the policy iteration, sorry, algorithm, you will see that if you have a value, if you manage to compute a value, that's not enough. I mean, even if you have the value, extracting the policy requires the model. So running these Monte Carlo rules, so that's another name for trajectories in RL. So you just roll out your policy in the environment. So it's not enough. I mean, you cannot start from any state, roll out your policy n times and do an average. Then you'll get a value, but you don't get something that you can extract a policy from. So again, this is why we define the Q function. So the Q function, if you manage to estimate the Q function with any method, then you can extract a policy from that by just being greedy. So when I say greedy, it means that I'm acting by taking the argmax of a function, and here it's the argmax of the Q function. So being greedy in reinforcement learning generally means I'm taking the argmax of the Q function and I have estimated my Q function in some ways. And here, the way in Monte Carlo methods, the way you use to estimate Q functions is by running rollouts, starting from any state, taking each action in that state, and then running n times my policy. So for each state action pair, I need to run n times my policy, starting from the state in which I landed after taking a in state s. So I mean in state s, I do action a, then I roll out my policy n times, I average over these entire trajectories, and this gives me the Q function for s and a. So that means that for s time a, well, the complexity of the state, the action space is s time a. So for this s time a, I need to run n rollouts, which is very cumbersome, but it is a good estimate. I mean, and this is a non-biased estimate of V or Q, the Monte Carlo estimate, just do an average over trajectories and that gives you a pretty good estimate of what you are looking for, given that the number n of trajectories you generate is large enough. So, yes, there are refinements of these methods given that, of course, each state along your trajectories can be considered as a starting state for the rest of the trajectory. That will give you plenty of biases, but you can correct these things. So you can be more efficient than just starting all the time from random states. You can consider each state in the trajectories as starting states, but that's kind of old fashioned RL, but it actually has a revival. Actually, all the things that work well today are things that were designed in the 90s or something, so. We spent all these time trying to make things efficient and now we simulate, so. No, what is the, what did we learn so far? This morning we learned about dynamic programming and what is the strength of dynamic programming? It actually is exploiting the temporal structure of the MDP. It actually links the value at a given state to the value at the next state, which is kind of normal. I mean, the definition of the value is the accumulative, the cumulative return you get from that state. So of course, if I'm in a given state S, I receive a reward R and the value of my state is just the accumulation of that immediate reward with the rest of the reward that we'll get along the trajectory. So it's very intuitive that you have this temporal structure into the MDP that you should exploit. You shouldn't wait for the end of a trajectory to know how much you are expected to learn along the trajectory if you know already how much you could learn from the next state. That's the structure, the temporal structure that you have in an MDP that you should be exploiting. And when you do Monte Carlo, you don't do this. You don't exploit the fact that the value in the next state is linked to the value at the current state, to the value at the current state. But the issue with dynamic programming is that, as I said many times in this lecture, you need to know everything. You need to know the transition kernel. You need to know the reward. You even need to know all the state and actions before end because you need to build these big tables in which you will store the values, et cetera. So that's a problem that you don't have with Monte Carlo method. With Monte Carlo method, you discover things over time. Even the state space, you can discover over time. You can create a table that extends. When you discover a new state, you just add a value in your vector and component in your vector. And you can actually learn, on the fly, the size of the state space, the action. Well, you'd better know the actions you can do. But even the state space can be discovered over time. So these are the two ways of doing Aura so far. Either dynamic programming, and I learn maybe a model, or I do Monte Carlo. And it turns out that there is a good way to mix these two things, which is the temporal difference principle. So the temporal difference principle, it actually is also very intuitive to explain. So let's say that you are in an ideal case where nothing is stochastic, everything is deterministic. My policy is deterministic. The environment is deterministic. The reward function is deterministic. Everything is deterministic. So when I run a policy in a given state, I always get the same trajectory. I always gain the same rewards at every step. And I can compute exactly the return from one single trajectory. So when I say return, return is the accumulative sum of rewards or the discounted one. So the return and the value is the average of the returns. It's the expectation of the returns. So value and returns in the fully deterministic case are totally the same. And so as you could see here, just have to replace the rest of the trajectory by the value in the next state. So I have v equal r plus gamma v in the next state. It holds for every policy. So I didn't mention any pi here. So for every policy, including the optimal policy, that holds. So now let's think about it as a supervised learning problem. If this is supposed to be equal to this, then just minimizing these two things, this minus this should be 0. That's very simple. So actually, if you are in a realistic setting, you will have some stochasticity just because you are learning. And maybe your system is non-deterministic. Maybe your policy cannot be fully deterministic because sometimes you want to do an action, you do another. That's the case in robotics. It can happen that if you are not accurate enough, you do something different. And then you end up in a state that you didn't expect. So this minus this will not be 0. And that's what we call the temporal difference. That is the difference between the value in two different states according to our current estimate of the value. So I have a current estimate of the value. I do the difference between the value in the current state and the expected value in the next state given this estimate. If it's not 0, it's called the temporal difference. And you can now see that where I'm going to. So I'm trying to see this problem as a supervised learning problem. That's where we stand in the 70s. So that's the way how neural networks were training was seen. I have an estimate. I have a target for this estimate. I want to minimize the difference between the target and the current estimate. And this is where the temporal difference method for RL came from. It's from where the neural network, all the neural networks were trained back at the time. And they were trained with a so-called Widerhoff update. So the idea of this update was that every weight in the network should be updated according to its current value, its current estimate, plus some part of the error in the prediction. So you were supposed to do predictions with your net. And this prediction had an error. We wanted to do supervised learning. So you can compute an error between the prediction what the value should be for that input. You make the difference, and that's the error. And you add a little bit of that error to the current parameters of your network. And that is supposed to converge to the parameters that will provide you with the good prediction for every input. So that's exactly what we can do, and we can do actually the exact same thing for reinforcement learning. If you want to compute the value function, you can actually create a series of iterates of the value function that builds upon the same principle. You start from random value function, and you add a little bit of the temporal difference along the trajectories you generate. For each trajectory, you generate temporal difference, and you update for each, not trajectory, for each transition from one state to the other, you generate a temporal difference that you can compute with your current estimate of the value function. So with time, you will converge to the actual value function of the policy that you are currently following. So that's just an evaluation method. It's not a control method, it's just an evaluation method. So TD0, so that's the algorithm TD0, provides you with a way to compute a series of iterates that will converge towards the value function from interactions, and you can update that value at every transition. You don't need to wait for the end of the interaction. That is why it's quite cool. And of course, this is a learning rate. So this learning rate tells you how much you want to update your value function, given the information you got from the current transition. And you see that the only information you get from the current transition is the reward and the state where you end up. So you have taken an action, you end up according to your policy, your policy is provided in advance, you end up in some state, you get some reward, and from these informations, state and reward only, you get an update on your value function, and that converges. Now, if I want to do control now, actually, value functions, again, I cannot do anything with that, except from saying that my policy is better than another. I have two policies. I can evaluate both policies in every state. If one is better than the other in every state, then I should take the other. That's the only thing I can do with values. I cannot really extract any policy from that. So I need to find a trick, again, to be able to extract policies from value functions. And of course, the way we do that is by applying value, well, this evaluation, this TD framework, we apply it to Q function. So the Q function will actually follow the same principle. So here, actually, if you look at this, I'm just using the Bellman equation for the value function. There is an equivalent Bellman equation for the Q function, which is actually exactly the same, replacing V by Q, and you get exactly the same. So this is the SARSA algorithm. It's about building iterates of the Q function, starting from any random initialization of the Q function. And after each transition, I'm looking at the state I end up. The action I will take in that next state and the reward I received, I built the temporal difference, I built the temporal difference, multiplied by the running rate, and this gives me added to the current estimate, and this gives me my next estimate of the Q function. And this converges towards the Q function of the policy you get. So why is this called SARSA? That's because you are using S, A, R, S, and A to update your Q function. So you just actually use a tuple of S, A, R, S, A to update your Q function, so this is the reason for which this algorithm is called SARSA. It's no acronym or whatever. It's very simple. So how do you do that in practice? So this is the algorithm that you will implement in practice. You start with a Q function, zero, for example. Plenty of good tricks for starting with a nice Q function. I will explain that later. But yeah, let's say you just initialize with a random Q function. And yeah, well, here I will consider that there is a total number of iterations after which I stop learning. But you can actually run SARSA forever and it will continue learning forever. So there is no need for an actual end of loop criterion. So for each, well, of course, there is a state choice. You can do random state. Or again, you can always start from the same state and expect that your environment will bring you in different states in different parts of the state space. So if you cannot reinitialize your state, you can actually let the environment bring you in whatever state. And for each state you end up, you have to choose an action. And the trick here is that you will choose your action as a function of the Q function. For example, you can take the argmax. So that would actually give you a policy that improves with time. If you take the argmax of the Q function, you can actually improve your policy over time. But there are plenty of different ways of choosing this function. And I will come back to that later as well. So when you've chosen your action, you perform it in the environment and you observe the next state and the reward. And from that, you build the temporal difference. So first, you need to build the temporal difference. As you see here, you need the next action. So you need to draw the next action from the same function of the Q function. That gives you the next action, a prime. And then you can build your temporal difference. And from that temporal difference, you update your Q function with a learning rate that you've chosen. It's a hyperparameter of the algorithm. And you do that forever. And as long as your function here is improving, so it's kind of going in the direction of improvement on the Q function. So for example, you do max or you do a probabilistic max or something like that, softmax, let's say, you will actually move towards an optimal policy of the shape of FQ. So that's also something I will come back to. This function will shape the space of policies that you can actually learn from this algorithm. So I will come back to that. But what is important to get from the Sarsar algorithm is that it helps you to learn after each interaction of the environment, after each transition from one state to the other, you have enough information to learn an update of your Q and FQ function and so you can improve over time your estimates and your policy. So you can also see here that there is no max anywhere in the update. And this can be seen as an online version of policy iteration. So it's just about evaluating the current policy. I'm evaluating my policy here. And from that policy, I'm trying to improve the policy with this function. So it's very much like an online version of policy iteration where you had to, policy iteration was a two-step algorithm where you had to evaluate the policy and then become greedy to get the next policy. And here it's a bit the same. After every step, you improve your evaluation of the policy. And from that improvement, you improve your policy. So that's an online version of policy iteration. So if we keep, yes, it can actually. So I actually made it a function of n, n of the action and states. So the question is whether the learning rate is a function of state and actions. And it can. I mean, for example, imagine that stochasticity is higher in some part of the state space than the other. You may actually want to have a small learning rate where it's very uncertain. Your estimates are very uncertain because of the high stochasticity. And you want to have a higher learning rate in some space where you, it's almost deterministic so you can learn very fast. So you can actually be sensitive to that. I guess you'll have plenty of classes about how to measure uncertainty in MDPs. But of course, that's an issue in the measuring. But you can have a system of experts, ensemble of experts, for example, and things like that. And you have plenty of methods to learn about uncertainty in MDPs. So that's a very generic case. In practice, you just have alpha is 1 over 1 plus t. It works. But OK. So as I said, this is a very. Oh, sorry. Yeah. Sorry. Yeah. It says q pi t there. And then later it says qn. Are these meant to be different or is it somehow supposed to be the same thing? Like inside the F? Oh, yeah. Sorry. No, that is qn. Sorry. That is qn. Yeah. That's my bad. Yeah. So like that's evolving over time. Yeah, yeah, yeah. Yeah, yeah, yeah. No, that's qn. Thanks for that. It's a typo. Thank you very much. Yeah, so that's the current estimate, qn. Other questions? Yeah, it has to lead to take an action that has a higher value than the current policy. It doesn't have to be the max, but it has to be a bit better than the current policy, in average. Yeah, imagine that. Yes. Yeah. Yeah. Do you have a mic? Oh, sorry. So can we estimate how many steps do we have to run this for at least? Or do we just guess on a large number and see what happens? If you read some papers about how well you know it's depends on the environment. It depends on the complexity of the environment. It depends on the stochasticity, et cetera. It depends on many, many things. They are probably upper bounds. It better ask Bruno about this. But it's a lot. It's super inefficient, I can tell you. OK, thanks. In this, do you actually have to perform the second action, a prime with the qn? Or can you actually just estimate qn plus 1 by just testing it? Yeah, you don't need to do action. You don't need to do the action a to end up in s prime. But you don't need to do action a prime. Anything else? No? OK, so just to recap, this is online policy iteration, so very rough version of policy iteration because you update only one state action pair at a time. It's very rough. One iteration of policy iteration would update the whole state space, the value of all the states, while here is just one single state action pair that will see their value change. And still, you change the policy from that estimate. So it's very asynchronous. You see it's very, it's a bit like if you didn't wait for convergence of the evaluation of the policy to change the policy. So it's kind of asynchronous policy iteration, and it's also proven to converge. So now, what if we use value iteration as an example? And what if we try to build an online version of value iteration? What would it look like? Well, it's just about changing the target. Here, I use the evaluation Bellman equation to build the target. I say that this should be equal to this. I do the difference between both. And that is my temporal difference. If I use the optimality Bellman equation instead of the evaluation Bellman equation, I build a new target with a max here instead of just the value of any next action. I take the value of the maximum action in the next state, the value of the action that provides the maximum return in the next state, according to my current estimate. So this is the Q learning algorithm. I think it's 1992 or something like this. It's very old algorithm. And that is, you'll learn about DQN I think this afternoon or tomorrow. And that is actually a bit coming from there. I mean, there are plenty of interpretation of DQN. It can be seen as value iteration or policy iteration. It's rather policy iteration from the latest discussions. But there is still a max. So the idea here is that you build your target using the optimal Bellman equation. And from that target, you will estimate a new Q function, which is not the Q function of the policy that you are following, but you are estimating directly the Q function of the optimal policy. Because this target leads you toward the optimal policy, and it doesn't estimate what you are doing, but what you would do if you were using the max over the actions of the Q function. So it immediately gives you the optimal value function, a Q function, sorry, from which you can extract the optimal policy. So this is Q learning. And it is called an off policy algorithm because it is not evaluating the policy it's following. It's evaluating a different policy. In this case, it's the optimal policy, but you could imagine having a target about any other policy and being able to evaluate the value of any other policy given the current policy that you are following. So there is a policy that you are following now, which is called the behavioral policy. And there is another policy that you are trying to evaluate here, the optimal policy. So how does that work? Again, you start with random initialization of the Q function. And then you don't need to sample the next action. You don't need to choose another the next action. You just take the max over the Q function to compute your temporal difference. And the temporal difference, again, is used to update the estimate. This is quite important, it's fundamentally narrow. So if something is not understood, I would encourage you to ask questions. Are you framing Q as like a table lookup? Yes. Or is it a parametric function? Because I think that. No, no, for the moment, you can store Q, exactly. For the moment, I mean, you will have value and policy approximation classes later. Here, I consider that you can store tables and vectors. So if it's a table lookup, then I assume it's guaranteed to converge. Yes. Well, Q learning, I mean, it's been a question for a while, but yes, I've given some assumptions at this. Do we have any constraints on the behavior policy for it to converge? Guess what? I mean, the behavior policy is to cover the state action space if you actually wanted to converge. Of course, I mean, if you have random initialization of your Q function and you never visit all the state action space, you never update. So it doesn't lead you to that state space and you cannot update. Yeah, I mean, that's an issue. OK, thank you. Yeah. Rn? Sorry? So I'm wondering why there's no multiplicative factor for their word, the Rn term, because it's of policies and their word might have been different. This one, you mean? Yes. This is, well, because it's the definition of the Q function. I mean, the Q function is what you get now plus the rest. And that's the rest that you evaluate as if you were following the optimal policy. The optimal policy or the behavior policy? Here, the target is the optimal. Right. So it tells you what you should do now in order to get, I mean, if you want to max that, after this, if you consider that your algorithm converged, you have now a policy, a value function, then you will behave, you will try to behave optimally according to that value function. So now it will tell you what is the action I should get to get the maximum out of current reward plus what I get if I follow the optimal policy. But the Rn we observed that came from the trajectory produced by behavior policy. Yes, but once you're converged, you try to take the max of this. So you try to take the max over the current, the reward you will get now plus the rest. And that is actually the optimal policy. So during the learning, you're not applying the optimal policy. And the rewards are not good. But once you have converged, then you take the max. And that works. Any, yep, can the behavior policy change over time? Like, can we just plug in the same policy derived from the Q function we are constructing and plug it in as the behavior policy? Would it still work? Well, you'd better to be exploratory if you immediately are greedy according to that Q function. You will be stuck in high values according to your current estimate, which actually might actually not be the exact estimate. So you will be stuck into that's actually the catastrophic things that you see in most of the paper. And most of the papers in the Deep RL, they show you a nice plateau incurs. But if you continue it, it's just like falls down. And that is the reason, because bad estimates or optimistic estimates lead you in bad part of the environment or potentially can lead you in bad part of the environment. And if you are greedy too early, that's what will happen. You're too confident in positive values and you end up always looking at the same thing. OK, thank you. Another question. So let's compare both now. So I took most of the pictures I took from the original Satan and Bartow book. So thank you, Rich and Bartow. So, Andrew. So the difference between SARS and Q learning now. So you will ask me, why don't we use directly Q learning? I mean, it's just stupid. You can learn directly the optimal policy. So why don't you do this? Well, it depends on the risk you want to take during learning, actually. So this is an example of an environment where you need to go from that state start here to that state goal here. It's a grid world with three by 10 steps in the case. And here there is a cliff. If you fall into the cliff, then you go back to start. And you actually want to go from state S to G as fast as possible. Of course, this is the implicit goal. And well, if you fall in the cliff, that makes the trajectory longer. And you get a reward of minus 100. So that's not good. And for each step you take into the grid, that's a minus 1 penalty. So that's how you force the system to try to find the shortest path. So now let's look at what happened with SARSA and Q learning. And these are learning curves. It's not what the optimal policy does. It's the learning curve. So while you're applying SARSA and while you're applying Q learning. So SARSA will actually learn this. Well, I have to say that the function that we are using here for SARSA is epsilon greedy. So epsilon greedy means that at every state, I have a probability of epsilon to be non-greedy, so to take a random action. And 1 minus epsilon to be greedy. So the highest epsilon, the more I am random. So if you are epsilon greedy in this environment with SARSA, SARSA will learn this path. Why? Because there is an epsilon change that I take random actions when I'm here and random actions, 1 times over 4, they lead me in the cliff. And that gives me very bad rewards. So because I'm learning according to a function that is intrinsically random, SARSA will learn this policy. So it's the optimal epsilon greedy policy. It's not the optimal policy. It's the optimum epsilon greedy policy. But during the learning, SARSA is much safer. So it gets higher rewards, higher returns. If you do Q learning, you will learn this policy. This is the optimal policy. That's the shortest path. Yet during learning, you get much lower rewards. Because your estimate is not good. Sometimes you go into the cliff. What you learn is that it would have been better to go there. But your behavior policy is random. So you actually learn from a random policy. And by falling into the cliff, that you shouldn't have fallen into the cliff. But you still fall into the cliff and get lower rewards. So the learning that what you should take from this is that you should not use Q learning if you are not willing to simulate. If simulation is safe, go for it. Otherwise, it's a bit dangerous to use these harsh type of learning where you put, well, anyway, when you have a max in an equation, it's never good. Just be careful of that. I'll skip the back with the lambda stuff. Yeah, how much time do I have? Because we have started. Yeah, we've started 15 minutes late. So we can go on until 1. So half an hour more or less. OK, then I'll give you just a very rapid overview of this. So now what you can tell is that we've been very inefficient in terms of usage of transitions. Because for every update of the Q function, I only use the current transition. But I made a lot of transitions before that have just been thrown away. I mean, the transitions that I've made can explain the value of my current state, even if it was 10 times 10 steps back. So why? Because if you look at the way we have built the Bellman equations, we just said that the return is the current reward plus the value of the next state. But you could actually say that the return is also the value of the reward at the next state plus the value of the reward at the state s plus 1 plus the value of state after that. And you can do that for n steps. So there are plenty of ways of defining or bootstrapping the value for building estimates. You can do just one step bootstrapping, two step bootstrapping, or n step bootstrapping. Actually, it's about playing on the bias and variance trade-off. The shorter the trajectories are, the more biased you will be because you are using immediately your bootstrap, your bootstrap immediately your value. If you make longer path, you actually have higher variance because there is plenty of stochasticity after each decisions. So the trajectories can be actually very different. So more variance, but less bias because you are closer to Monte Carlo estimates. So all to balance these two things, all to make bootstrap to go faster and update along the trajectories. And all to use Monte Carlo-like updates so as to have better estimates with the risk of higher variance and all to mix these things. I'm talking about that. I mean, I will not go through the whole details of these things, but I'm talking about that because if you look at the literature on deep RL and especially DQN and Rainbow and all these value-based algorithms, you will see that there is a factor N, which is the number of steps that you do before bootstrapping. And it's actually important to stabilize these algorithms. So again, this is not a new thing. The trade-off between variance and bias has been studied in RL for the last 40 years. But we rediscover the impact of it now that we are actually using it in deep learning and in practical applications. And it turns out that there was an elegant way of doing this trade-off as by making a geometric average of all these returns. And it leads to a very simple implementation, which is called eligibility traces. The idea is that you leave a trace on each state or state action pair that you've visited. And this trace will decay with time. And it actually is a number that tells you since how long you have been visiting that state and how much this state should impact on the update of the value of your current state. So that's what was called eligibility traces. I am not going to go very deep in this. But remember this, when you'll use DQN and you will have to choose your end step, your factor N for the end step bootstrapping, that there is a more elegant way to deal with this end step. Yeah, I will not go through the details of this. But yeah, the idea of this is then that instead of doing just a one step update of the value function, you actually make small step updates. So states that have been visited a long time in the past still benefit from the update of the current transition. But with the lower importance, and this importance is weighed by the eligibility trace, which tells you the time spent since you've been visiting that state back in the time. Yeah. Now, one question that we should ask ourselves is what is the type of function F that we should use in SARSA? Or what kind of behavioral policy that we should use in Q learning? And that is actually linked to a very important problem that we have in reinforcement learning is the trade-off between exploration and exploitation. So exploitation is what you do when you have converged. You are happy with your Q function. You exploit it. You use it. You do the argmax on it. And you don't learn anything anymore. And you exploit it full. If you're sure about what you've learned, you'd better exploit and use the estimates to build a policy. But before you've converged, all the algorithms requires full coverage of the state action space to make sure that you are converging. And so that means that you need to explore. At some point, you need to kind of say, well, I think I should go this way according to my current estimate, but I will go in this way just to see what happens because I'm uncertain about what is the value in that part of the state space. So there is a trade-off to make between exploitation and exploration. And there are plenty of ways of doing that, plenty of very smart ways. And I think that Alessandro Lazarec will tell you about how to do this. I'm just in one slide telling you what is the issue there. The thing is that while you explore, you are making a bet that you will not destroy your system, for example. If you are using a robot, you may actually break your robot just because you explored too much. So you need to be somehow confident in your exploration, but you also need to be exploring enough to make sure that you will learn the best policy, so the best possible policy. So of course, if you do SARSA or if you do cure learning, you can be greedy immediately. And as I said before, if you're greedy immediately, you may actually be overconfident in your positive values. And since they are just estimates, positive values may actually lead you in very bad situations. So you need to take into account the value, but you shouldn't be able. You shouldn't be too confident and be greedy. And also, it will probably not lead you to convergence, because unless the environment is very stochastic and brings you into every state action pair naturally, because it's super stochastic, and whatever you do, you end up visiting the whole environment, then greedy policies will probably not get you into the full coverage. So very simple other policies. The one that I explained before, the epsilon greedy. So yeah, I actually always take the greedy action, the best action according to the Q function, except with the probability epsilon. And with the probability epsilon, I take random action, including the optimal. So it's purely random. Surprisingly, it's hard to do better, to be honest. There are plenty of very smart things to do exploration in RL. DQN, for example, the epsilon greedy works well, just like you need to wait forever before it converges, but at least at the end, you have something. Something that is a bit smarter is Softmax or Boltzmann policy, or Gibbs policy. So it actually requires the selection of a new hyperparameter. So here, you have a hyperparameter, which is epsilon. And usually, epsilon decreases with time. I mean, you want to explore less and less with time. The more you are confident with your estimates, the less you want to explore. So with time, you decrease. And usually, there is a fixed scheduling on epsilon that you just decrease with time. But here, if you want to be a bit smarter, you build a stochastic policy that takes into account Q values. So it's a kind of peaky distribution that is flattened the highest the temperature is. And the idea is that you will most often take the best policy, given your current estimate of the Q, the best action given your current estimate of the Q function, but you have a kind of distribution overaction that is proportional to the Q function. So you will very unlikely take actions that are very low in terms of Q function. These are exploration management strategies that work at the level of the policy, but you can actually work at different levels. You can work also at the level of the initialization of your lookup table. So imagine you have a lookup table and you know the state action space. You have a lookup table for each Q value. And each of the Q value, for each algorithm, you need an initialization. You just put the max Q value that you can get. If you know the bound of the reward, so that's going to be r max on over 1 over minus gamma. This is the highest Q value you can get. So you initialize your Q value everywhere with that value. You don't do random initialization. You do optimistic initialization. And what will that do? Actually, since I told you in SARSA, for example, I told you that the function needs to take an action that is always a bit better than the current. If you actually set up all the Q values at the maximum, every time you select an action, if you never visited that state action pair, you will always take that action because it has, of course, an higher value than all the others. So it will force you to go in all the states that you've never seen before, because optimistic initialization, coupled with an epsilon greedy or a softmax or whatever, will always select higher values, so those that you never visited. And then you get a reward. You figure out it's less good than you expected. And then the value decreases because of the update. And you will not visit that again unless you have visited everything else before in terms of exploration. And, of course, there are plenty of things that are coming from the information theory and all the learning theory of also neural networks, et cetera, all to measure uncertainty about values. So there are plenty of ways of doing that, like assemblers of experts or computing the amount of information provided by a transition beforehand. Just do a forward pass before doing the transition and see whether the quantity will change. There are plenty of ways of building bonuses on the reward to make your exploration more efficient. OK. So just to conclude on this, and whether we want to finish on time, I can conclude here or I can speak a bit of policy gradient in 10 minutes. Depends on what we prefer. There is still some time if you want to. I mean, I can conclude here and take questions. Well, conclude after two slides, let's say, and take questions or I can go to policy gradient. Well, let's do that. So yeah, the conclusion of all these slides and algorithms is that RL provides a very good way to do control without model of the physics. You just have to interact, learn, observe situations and rewards, and you can actually compute, estimate, and improve over time. And it's also learning online. So I am saying online learning, I'm not talking about bandits here, I'm just saying that you can learn online when you're plugged on the environment. Problem is you need a lookup table. Lookup table, I mean, if you want to play Go, for example, it's 2 to the power of 256 potential situation that you will see, so there's absolutely no way to store this into a computer or memory. So large state space, large action space. For the moment, I also told you about action space that you can count. You can actually take the argmax. That also is an issue. If you want to do continuous control, you cannot do either SAR-SAR or Q learning. Also, sample efficiency is a problem. RL is known to be extremely simple and efficient in terms of interaction. And well, it's less simple efficient than supervised RL, than supervised learning, because in supervised learning, usually your loss tells you that something is good and all the rest is bad. While RL tells you that just this thing that you did may be good or bad, you don't know. It doesn't tell you what you should have done. It just tell you whether it was good or not. It doesn't tell you the solution. So it's much less signal into reinforcement learning than into supervised learning. So you don't know what you should have done. There is no label of what you should have done. And what you learn is just for the situation that you have seen, and not all the other situation you cannot generalize. That is bad. So well, that's going to be the topic of the rest of your classes, all to generalize, all to be more simple efficient, all to be sure that you are, because here you are in control loop. So if you are generalizing, but in a bad way, like in your net, if you just tell it what you did is good here, it can actually increase the values very far away just to compensate the learning on the parameters. It gives you a kind of surface on the solution space. If that moves, despite you are only choosing one point, moving one point, it actually can change everywhere. And so it's very hard to generalize in control, because if you actually are relying on your estimates to derive a control policy and that these estimates are overestimated, you're over optimistic on the estimates, then actually it leads you to a part of the environment that is not suitable for learning, that may end up with catastrophic behaviors, et cetera. So there are plenty of ways of making sure that you are generalizing well. But first things were studied in the linear setting. So you can actually do everything that I said today. You can do with function approximation, where the Q function is not a look-up table anymore, but it's approximated by, for example, a linear approximation with some basis functions that you define before and. And they are a function of the state and actions in the Q function. And you can generalize SARSA to the linear function approximation case. You just have to replace Q by theta phi in the computation of the temporal difference, and then you get it. You get the new parameters. I don't think I have time to discuss about policy gradient, but I know you have a class also about that the day after tomorrow. Maybe let's see if there are how many questions there are about the value-based methods. Well, actually, if this is a Dirac that brings you back to the look-up table, if the function is just a Dirac on the state action, it gives you a grid, actually. So it generally is quite, well, even to discreet. This doesn't have to be continuous. It's a linear combination of things. In your conclusion slide, you said that large state spaces are a bad thing. But in the previous lecture, the lecturer said that we can build the look-up table online. So we add a new entry to the table. And so it's not really the fact that the state space is large that is a problem. In the game, of course, for example, I guess there are some states that you actually have very low probability to visit. So do you care about those states? Yeah, that's a very good point. Most of the time you don't need to see all the states. But if you want to ensure convergence, you should, actually. So the more states you will see, the better you know the dynamics of the environment. Intrinsically, all these model three testings are actually still relying that these updates are expectations. It's empirical expectations. So if you want to do the empirical expectation correctly, you actually need to see as many transitions as possible. So hopefully, you try to see everything, at least in these cases. Of course, in the case of Goetze, you just learned from, well, it learns from self-play. So it generated more trajectories than the whole humanity. Thank you. Yeah. First of all, thank you for the lecture so far. You already mentioned about the hyperparameters and how that's a challenge of its own, and an expertise of its own. You gave a remark about the exploration and your endorsement in the 8Griddy exploration. I'm wondering if you have a remark on the learning rate. The learning rate, if you do these lookup table stuff, it's going to be something like 1 over 1 plus alpha t, with alpha being another hyperparameters. But it should decrease exponentially with time. That works well, let's say. If you do that, I mean, the learning rates in deep learning is another science. So most of the tricks that are used in deep learning are trying to build on momentum. So try to improve the, if you know that your gradient is going in the right direction, you really want to, for several steps, you want to accelerate in that direction. That doesn't work well in reinforcement learning, because you both learn the representation and the policy together. And these large updates, or purely simple driven updates, don't work well. And so it's still under, what is the best way of setting up the, at least, what is the most principled way of setting up learning rates in deep parallel is still ongoing research, really. So it's not easy. Thank you. Welcome. Thanks a lot for your talk, Olivier. I had one quick question. At some point, you mentioned that DQN is closer to policy iteration than value iteration. Could you comment on that? Well, it's, you know, there is a max. So the EDI update looks much more like a value iteration. But when you think about it, the max is there to extract the policy. So it's kind of value iteration as well. So yeah, it was, I shouldn't have made this remark, because it's something that I've been arguing a lot. But yeah, it looks, at the end of the day, it's true that it does rather a policy iteration where the policy is including a max, let's see. Thank you a lot. Hello. Thank you a lot for your talk. I was wondering, when you apply, when you want to try the reinforcement learning approach, do you usually start with these methods, which seems quite simple, or are they considered maybe inefficient compared to more recent methods? Do you directly try to apply them, or do you start it with other methods that we will see maybe on the next lectures? Well, it really depends on the size of your environment. Adding a neural net to a simple problem makes things more unstable. I mean, for example, if you, the famous cart pole stuff, DQN does much worse than logistic regression, for example, with two neurons. It really, it's a reservoir problem. Use the simpler solution if your problem is simple. Of course, if you're trying to solve autonomous car driving, then you'll need much more efficient things. But if you can simulate, and your problem is not that big, I'd do this. OK. Thank you. So earlier, I think you said that if you do a pure online version of Q learning by plugging in the policy estimates instead of the behavior policy, that could be a very bad idea due to it going into a greedy path. So when you actually try to do the simulation, do you use historical data, or is it some sort of hybrid approach? Is it possible? Yes, of course. There is a lot of work on offline RL and a combination of offline and online. So offline RL, I also think you'll get lectures about that. But yeah, RL can be cast into a supervised learning problem, and especially the Q function approximation. And for that, you can use historical data to do. Also, if you do Q learning, for example, and you can simulate, you'd better have a completely random policy with random starts if you can, re-initialize. So it gives you a good coverage, and you can either learn with the Q learning update or with a batch algorithm. And this is an off-base question. But I was just wondering if I was thinking about this, that has there been a theoretical understanding of the convergence of Q learning with respect to some sort of distance between the behavioral policy and the optimal policy? Like, it seems like it should converge? It's all about the coverage, whether the, I mean, I don't think that there are results that measure the closeness to the optimal policy of the behavioral policy. I'm not sure there isn't any. But yeah, it's usually about coverage. All this policy covers the state space, state action space. OK, I think we are all hungry. So maybe we can keep chatting over lunch. But don't leave yet. I have to tell you a couple of things first. Well, first of all, let's thank Olivier for the wonderful lecture.