 So I hope you had a great weekend. We're going to start the second week of the summer school. And I'm going to introduce the first speaker of today. So today we have Mathieu Geist. And Mathieu is a full professor at the University of Lauren currently on leave at Google DeepMind. He's doing reinforcement learning in the LL team in Paris. And well, Mathieu is one of those researchers who managed to somehow bridge the gap between LL theory and DPRL. So some of you may know him from his nice works that he has on regularized MDPs. In particular, one great paper that is wrote with Olivier and Bruno that we had last week with us. And some of you may also know him from the papers that he has on actor-critic methods in DPRL. And maybe you've heard of, I don't know, the Mood-Charlesen trick that I'm sure is going to mention today. So it's always exciting to have someone talk about this topic. So I'm sure it's going to be nice. So please, everyone, give a warm welcome to Mathieu. Thank you. So I talk about regularization and reinforcement learning. If anything is unclear, feel free to interrupt me. I mean, if we don't cover everything, but you understand what we cover. So yeah, really feel free to stop me if something is unclear, if I'm too fast, or whatever. And we'll do a break at half the lesson so that you can rest a bit. So today we talk about regularization. And regularization in reinforcement learning. So regularization I talk about in reinforcement learning is a bit different from what we call regularization really. So this term can encompass a lot of things. Typically in supervised learning, we call regularization something that will shrink, say, the hypothesis space that is the size of the neural network you would search in. For example, by putting some L2 regularization on your weight, this is an indirect way to shrink the space of function you're looking at. But here it's really different. Here it's really the barrier. In reinforcement learning, you learn a barrier, and it's this barrier that we try to regularize. So we'll start with some warm-up, things that you should know now, but it's to recall some notations and to make things crystal clear. And then we'll see some case 2D with two very classic case entropy and Krullbach-Labor organization. And we'll see that basically, once you have done these two cases, you can combine them in any way to create more cases, some being interesting, useful, or not that much. And we'll see what is the issue and what is the remedy that is ensured as leaked just before. And I'll cover some related topic provided in of time. So as a warm-up, we'll start back. So you've seen dynamic programming at the beginning of the summer school, I guess. You've seen DQN, and these few slides are just to make crystal clear that DQN is in just value iteration. Because this is a scheme that I will modify all along the talk. So it's pretty important to have well in mind what value iteration especially is. So this you should know, it's a closed-loop control. You have an agent that observes states. Let me add some notation. The agent is a policy. A policy will associate to each state a distribution of action. So when you do classic reinforcement learning, you basically often care about only deterministic policies. That is, you want to choose a sole action for a given state here for a reason that will be clear later. We'll look at stochastic policies. That is, we want to add, to associate a distribution of overaction for each state. We have the value function, which is the expected cumulative reward. And we want to find an optimal policy, maximizing the value function over the long term. And for computing the optimal policy, you have a bunch of possible approaches. But one thing which is prevalent in reinforcement learning is dynamic programming or approximation of dynamic programming. Even if it's not the sort of possible approach. So this is basically again for the notation. The Q function with the value function, you only have access to the current state and it's hard to compute the best action in the sense of greedy action. So Q function have been introduced to address this. And the Q function is the expected cumulative reward when following some policy pie. When you start from a given state and choose a given action A for the first decision and following the policy pie afterward. And this Q function can be simplified to the reward plus gamma, the expected Q function in the next state. And this is called the bootstrap term. So this is something you should know yet. And you can write this Q function as being the fixed point of the Bellman operator, T pi, that's associated to Q as the reward plus gamma, the expected Q value in the next state. So I will work with only this guy afterward. And next value iteration is a specific algorithm that try to solve for an MDP by solving for the optimal policy. This is something you should have seen already again. But the way I presented, maybe Bruno presented it like that, but it's not the most usual way. In the sense that often you see a value iteration as iterating the optimal Bellman operator, T star. But here I will iterate the evaluation operator for the policy being greedy according to the current Q function. This is purely equivalent, but I want to really see what happens at the greedy step and what happens at the evaluation step. And here I focus the whole talk on value iteration because it's simpler because it's simpler because we don't have a lot of hours too. But many things that I will tell you today will apply to a more general scheme like policy iteration or even modified policy iteration or over-variation. So value iteration, you have the greedy policy. You can define the greedy policy as being the policy that maximizes statewide the Q function. So it gives you deterministic policy. You may have action having the same value, you just pick one at random. And value iteration is a two-step, is a iterative algorithm, each iteration having two steps. And the first step is to compute the greedy policy according to the Q function. And the second step, this is called the greedy step. And the second step consists in applying the Bellman operator for this policy to the previous Q function and this is called the evaluation step. Evaluation is maybe not the right word because you don't compute the full Q function of the policy. You just apply it once the Bellman operator, but I'll talk about the evaluation in this case. And the more classic form that you may see more often is that the next Q value is the watt plus gamma's expectation of the next state of the maximum of our next action of the previous Q function. This is the classic way, but these two way of writing a value iteration are purely equivalent. And we'll start with this forward follow. And then the question is once you want to, you can do value iteration. If you have, say, you know the dynamics, the transition kernel, you know the reward function, the state space is small enough, the action space is small enough, you can compute everything. You apply this scheme for an infinite number of iterations and it will converge towards the optimal policy. But often you have a number of issues. The first one is that the model is unknown, that is, you cannot know exactly the probability for a given state action, but what you can do is you can interact with the system and sample the next state from the current state action couple. The second issue is that you have to learn from data that alone not knowing the model. And the third thing is that the state or action spaces are too large for representing exactly the value function as a table. So you have to use function approximation and you can use linear function approximation or you can use an over the neural networks more or less big. And if you look at the value iteration update rule, so this guy can only be approximately represented even if you were able to compute exactly what is in the right hand side, it will not be able, it will not be possible to compute this as an equality. You should have an approximation because you're not sure your space of neural network are representing any function and not only this function. The second issue related to the model being unknown is that the expectation cannot be computed. So when you cannot compute an expectation, what you do is that you will compute an empirical expectation and what we'll do there is that we'll use a single sample empirical expectation and it will be enough. And you want to have this equation being true for any state action couple that are possible but it's not even possible to animate all possible state action in your MDP. You cannot go through all configuration. This alone is a super hard problem. So what you will do is that you rely on data, the data you have. And so this is how you can some more introduce DQN to cope with these three issues. The first issue about the key function being not possible to represent as a table is that you will approximate this key function with a neural network. So TTA are the parameters of the neural network and you will just try to fit your target to this neural network. You don't know the expectation over next states but you can use a single sample next state empirical expectation and you cannot optimize over all possible state action couple. So you have a data set of experience in the transition by the agent and you will use this data set to define an empirical expectation over the transition and try to minimize here the square loss between the next Q value approximation and the current target, the bootstrap term. And you have basically DQN so you have a number of additional issues that are a rare for search by themselves. The first one being how to fill the replay buffer or the data set you will use for learning. And this is a big issue compared to supervised learning. In supervised learning, usually speaking, you have a data set which is fixed. You have a search about how to fill this data set, how to create good data for the task you may have. But in reinforcement learning the data set you have is something that often you have control over and the way you control the data set grow is not well understood. So this is something I will totally skip in this talk. That's also the question of exploration, exploitation dilemma, how to interact with the system such that having diverse enough data but not doing exploration all the time. This is something I will not talk about again. And our question is when to adapt the target network because if you look at this equation, this is the target network, the bar, I'll write the target as bar. If you follow the dynamic programming ID, you would learn as much as possible this Qt data. It is you would use a huge data set to learn it but it's much, much more effective to use as much smaller data set and to update the target network more frequently. So I guess you had some course and even possibly practical session on that and this is just DQN. So I am spending a bit of time on this algorithm just to show that from value iteration, you can go quite readily to DQN on the loss component. And then you have many other aspects of DQN that are super important but when you change your value iteration scheme, you will change your loss and you can all over components of DQN as they are. You don't need to change them. So in the end, you work on value iteration, you do some things, we'll see what. And you can just change the loss of DQN and in the end have a new algorithm just by changing the loss. And the question and so is how good DQN is. So this is only a partial answer, a very partial answer, but this will be important later. So a way to write DQN for analysis is to say that so greedy policy is greedy, the greedy policy is really regarding the current function and then you apply the Bellman operator. So when you compute the greedy policy, it's max of our actions. I will consider mostly discrete actions that are small in a small enough number so you can compute the maximum. When you compute the maximum, you have no error. Even if QK is wrong, computing the maximum of QK is not wrong. So you apply the Bellman operator and you do some error. And there are basically the difference between what you compute as you're updating neural network and what you would obtain if you were able to compute everything exactly. So epsilon K is just a random error that depends on the learning aspect and it's just a random thing. And the question I will address in this talk is if I do some error there at each iteration, what I will pay in the end when for the error I will have with respect to the optimal Q function. And one bound is the following one. So it's really general because it doesn't make any assumption about how epsilon K is produced. It just you have an epsilon K, it's whatever you want. You can have a linear parametrization, you can have a neural network, whatever. And once you have this epsilon K, the distance to optimality is the left hand side. So it's a difference between Q star and Q pi K. So pi K is a policy you will compute with your algorithm. This is something you know because it will be the greedy policy according to your last neural network. It's what you apply to the system. Q pi K, this guy is the true Q function of this policy. This is something you don't know. This is something you cannot compute in general. But the bound will say how far is the policy to the optimal one in term of Q function? And you will have three terms there. The last term is the rate of convergence. It's basically the rate of convergence of value iteration where if you do no error. If you put the epsilon to zero, it's pure value iteration and you have only this term remaining and it will converge linearly. But if you do errors, you will have the first term. The first term is something that I will call the horizon factor. The horizon factor is because when you are in a gamma discount in MDP, it's an infinite horizon MDP that can be with a gamma discount. And you can see it equivalently as at each beginning of an interaction episode, I'll sample a random horizon according to some geometric law of expectation one over one minus gamma. And I will play for this random time and then I stop. And if you do this, it's purely equivalent to having an infinite horizon gamma discounted MDP. If you draw random horizon and you play, you consider an discounted reward among this episode. So that's why one over one minus gamma is called an horizon because it's expected optimization horizon of your MDP. So you have a square dependency to this horizon and this is unimprovable. You cannot do better with value iteration. It's not a proof artifact. And you have the error term there that basically say that you will pay for a discounted sum of the absolute value of, sorry, of the errors, epsilon j. And this is a bad result in the sense that imagine that epsilon k is zero mean and IID. It's not the case. It's much more complicated. But if it was zero mean IID, you would expect the scheme to converge towards the optimal policy. And it's not the case at all. If it's zero mean, the sum of error will go to zero by the law of large numbers. But in this case, you don't have the sum of error but the sum of the absolute value of zero. So it doesn't go to zero. The only way for this guy to vanish is that as iteration proceeds, you want the error to go to zero. It's the only way for the algorithm to converge in the end. And it's a very strong requirement. So this doesn't tell everything because typically what I don't tell here is how do you control this error term? How do you make this error term small? And it will depend on what the error term is on how you do the dynamic programming scheme. How do you generate sample? How, what kind of function approximation you consider and so on? And for DQN specifically, I don't have a more fine bound for this. But it gives some ideas of what you need to control in the end if you want to have guaranteed good solutions in the end and with value iteration or with DQN, you should have this guy going to zero according to this analysis. And we see that regularization allow to improve this. So now we'll talk about regularized approximate dynamic programming. And we'll see what it is, the many next slide, but basically you want your policy instead of being greedy. I take the greedy policy, the pure greedy policy, you will try to soften this greediness. You will try to move towards the optimal policy, but while staying close to ever some predefined policy are from the previous policy of the dynamic programming scheme. So the question of why doing regularization in reinforcement learning is something that appears, for example, when you see reinforcement learning as an inference problem, as probabilistic inference. This is an online of work that has close connection to this, but I won't cover here. It's sometimes used for in enhancing exploration. If you want to have exploration, you should try all action enough. And one way to try all action enough is to say I want my policy to go towards the best action, but I want it to have a high enough entropy to be stochastic enough such that I continue to do some exploration. There are some argument about smoothing the optimization landscape because when you add some entrepreneurialization, the thing is, you can think about it. If you take the greedy policies, the operator that takes a key function and computes a greedy policy is not smooth. It's not deep sheets. And when you do optimization, it's a big issue, generally speaking. But if you add a bit of entrepreneurialization, as we'll see later, it becomes deep sheets. And this is really nice because the function you're optimizing are more well-behaved. There are over-reason that we cover in the rest of the talk, but one recent also, recent for doing a regularization is linked to large language model, which is a quite hot topic currently. And you may have heard about reinforcement learning being used for fine-tuning large language models. In this case, the large language model is trained with a large amount of data and then it's fine-tuned using supervised learning. And then you do reinforcement learning to learn a raw function that is say, learn from human interaction or human feedback. But in this case, your initial policy, the language model is a good policy. You want to change it a bit. And the way to change it a bit is to say, okay, I will try to optimize this raw function, but I don't want my raw model to change too much. The language model is a policy. You don't want this language model to change too much. So you will add some regularization that force the policy to not move too far from the initial one. And this will be basically covered in an abstract way. And here we will focus on the viewpoint of regularized approximate dynamic programming because, well, it's the one I'm most familiar with, but also it's also covering a lot of things and making connections between things that may look disparate when you look at the literature. And a part which is not convert, but complementary is regularization in the linear programming formulation of MDP. So this is less classic, but this is important too. The thing is that the kind of algorithm you will get, the kind of regularization you will do, what kind of quantity you will organize are pretty different. So you can have a look at these references like the first one should be a relative entropy policy search. There are others, but this one will be covered here. So I will introduce some notations to make things even shorter. And the first thing is that one thing you want to do is to compute the expectation of the Q value according to policy for a given state. So it's a sum of our action of the policy times the Q value for a given state. And I write this as pi dot Q because component-wise it's a dot product. An expectation is a dot product and I will make a huge use of this notation. And this is a vector which is as big as a state space. PV is computing the expectation of the value function according to the next state, according to the current state action couple. So it's a SA vector. With this notation, you can write the Bellman operator as R the reward plus gamma time P applied to pi dot Q. And the greedy policy can be written as maximizing over all policies pi dot Q component-wise because pi dot Q, you search for a policy. It's a vector which has positive element summing to one. You work in the simplex. So the argmax is really overaction is really the same as argmax of a policies. So it may seem a bit overly complicated. It's simpler to search for a single discrete action than to search over vector in the simplex, but this will be useful afterward. And the entropy is written then using this notation as minus pi dot log pi. So as the entropy says, how diverse your policy is. The policy with maximum entropy will be the uniform policy. It will have entropy of log number of actions. And if you consider a deterministic policy, it will have entropy of zero, which is the minimum possible entropy. And the feedback level divergence between two policies pi one and pi two, it's not symmetric and it's pi one log pi one minus log pi two. And this is bounded below by zero. The only way to have a zero feedback level divergence is that the policy are equal and it's not bounded from above. And with this notation, you can write the approximate value iteration scheme. As you look for the policy pi k plus one as the maximizing of a pi of pi dot qk, and you compute the next q function as the watt plus gamma time, the transition kernel applied to pi k plus one dot qk plus some error that you will do because you use neural network or whatever. And one maybe important thing is there here. When you write things like that, you have the greatest step. The greatest step will solve for an optimization problem which is I want the maximizing argument of pi dot qk. And when you apply the Bellman operator, what you bootstrap is the solution, the maximizer is the maximum of this optimization problem, so pi k plus one dot qk. So you have a strong connection between the greedy step and what you will propagate in the evaluation step. And what we'll do is that basically we'll change this guy. And what I'm saying is that if we change this guy, it makes sense to change this guy accordingly. So we'll see some example and how to do organization. What does it mean to do organization in the greatest step? So for now, we have this greatest step, I compute the optimal, the greedy q function. And here what I show is the simplex. The simplex is a space of probabilities over actions for given states, so it's a policy space. And here I have three actions. And the green dot is pi zero. So if I compute, so pi zero is an initial policy. If I compute pi one, it will be greedy according to the q or will compute according to pi zero. And this pi one is deterministic. So it will be on one of the corner of the simplex. And then I compute pi two, and pi two will be there and so on. But you will compute a set of deterministic policies. So you will be in the corner of the simplex. Now, if I add some entropy regressions, entropy is basically the cool back lab of divergence towards the uniform policy. And you could replace it by a base policy that you think is a good solution and something you don't want to deviate too much from. So in this case, I compute pi zero, which is uniform in this case. I compute pi one, so the greedy policy will be there. But I want to maximize pi dot qk plus the entropy. So the solution of the optimization problem is not deterministic policies, no, a greedy, stochastic policy. The solution is pretty simple. I'll show it later. And you remain inside the simplex. And then you compute pi two. And when you do so, you will compute pi two that will be greedy according to the previous q value. And you will try to remain close enough to the initial policy again. But you will forget more or less what happens at pi one except for the q value. Next, you can also say that you want to remain close to the previous policy. So the reason for doing this is that basically you can see, you can think of the q function as a kind of gradient direction. And it's indeed a natural gradient of some objective function. But you can think of it as a gradient direction. And when you compute the greedy policy just pi dot qk, when you compute really the argmax over action, what you do is that you follow this gradient direction with a huge learning rate. And this is something which is, generally speaking, a bad idea in optimization. Your function should be super well-behaved for a big learning step to work. And it's basically the same idea. You will do error, you will do approximation. So if you do approximation, you cannot rely too much on your gradient direction and you will do small step. And a way to do small step in this case is to say I have my current estimate of the policy, my gradient direction is qk, and I will do a small step by saying I will follow this gradient direction, but I try to remain close enough to my current estimate, which is pi k. And if you think about, if you're not familiar with mirror descent, it's not a problem. But if you think about classic gradient descent, you can say that classic gradient descent is basically, I have a current estimate xk. I have my function. I do terror expansion of this function. I linearize this function. And I optimize for this function, but it's linear. So you will go to plus infinity or minus infinity. But you say, I want to optimize this linear function, but with some organization that say, I want my L2 norm to stay between my new point and the old one to be small enough. And if you do this, it's exactly the vanilla gradient descent that you all know. And here, the idea is the same, but we work with policies, so it makes more sense to consider cool back-label divergence and L2 distance. But it's really the same idea. And so, I have my pi zero. I have pi one. This is the same step as before because the initial policy is uniform one. But for pi two, I will say I go towards the gradient policy at remaining close to the previous one. And you can combine both things, by saying that you want to stay close at the same time to some initial good policy or some initial meaningful policy and from the previous policy. And then you can replace the care by something else, by say any breakman divergence. You can replace the entropy by something else and so on. This can be generalized, but these two guys are the one more commonly used. That's another way to do some organization, which is also linked to this idea of seeing the Q value as a gradient. And the way is to say, okay, I will not be greedy according to the last Q function, but according to the sum of all past Q function. So this is a bit less frequent and the question is why doing so. And the intuition is the following one. Assume that the QK, it's not, you're not doing value iteration. You do something much more simple. QK is the optimal Q function plus some noise. And you just, you can observe the optimal Q function for some noise. You cannot observe it exactly. And this noise is ID zero mean, so well-behaft. If you're greedy according to QK, there's no way your optimal policy will converge because you will pay for the noise for each policy you compute. But no, if you sum the QJ, what you will get is the sum of Q star plus the sum of noise. And the sum of noise, because of the law of large number, the law of large number, the sum of noise will cancel out. It will, the noise will go to zero and you will get the optimal policy in the end. So that's why you may want to do this kind of realization. And we'll see later that this is closely linked to what you do when you do cool back up organization. And next, the question is, should we realize the greatest step? And as I said before, when you optimize something there, you propagate it here. So it may make sense to do both. So you can choose ever to not do, to keep the organization step as it is. And this is quite current approach in the literature, but it's often the wrong thing to do. Or you can remark that here, if you modify this guy, and given that this guy was initially what was propagated in the Bellman Imperator, you can modify this guy too. And you can organize the same way as the evaluation step. And this is also often done in the literature, but I truly think this is the correct way to do things, at least from a theoretical viewpoint. And here we are. We have, this is what I call mirror descent, value iteration, because this step is really close to mirror descent, mirror descent being a generalization of basically gradient descent. It's not purely mirror descent because we don't really have a scalar function that we optimize for. It's really a dynamic programming scheme. We do it for each state, and there's no, we cannot just choose optimization tool to say, okay, I have this assumption, it's convex because there's no function. So it cannot be convex, and you have to do other things. But from an algorithmic viewpoint, it's super close to mirror descent and the name. And here I write omega, omega can be any convex function. And the negative entropy in the policy, and the negative entropy is such a convex function that you could choose over a kind of convex function. Most of the results I will present apply readily. And yeah, we can do a case study, and then we'll do a little break. First one is entropy organization. Because I said, okay, I will add some entropy organization, this is my regularized value iteration scheme. The question is how to have a practical algorithm from it? And then I answer, why would we, what kind of theoretical result we may have? And so this is value iteration with entropy organization. And how to derive a practical algorithm is just by doing exactly the same thing as we did for DQN. We look at what happens when we know everything, and then we put approximation where we need approximation, and we get a practical algorithm. So here, the greatest step is you maximize over some space, here's a simplex, pi dot Q plus the entropy. The entropy is a concave function, so it's minus a convex function. And this is something which is very well known in the optimization literature. It's called the Legend-Fenschel transform, or sometimes convex conjugate of a given function. This is well studied, and the solution of this optimization problem is called the convex conjugate, and it's written omega star of Q. When omega star of Q is convex conjugate of omega of pi. And useful properties that maximizer of this optimization problem is the gradient of the convex conjugate itself. This is mostly useful for theoretical analysis. But if you take the negative entropy, the solution of this problem is well known, and the policy will be the softmax of the Q function, and the solution, the maximum, the max of our pi dot Q plus the entropy is the loxomex. So this is written with my dot product notation, but it's the log of the sum of our action of the exponential of the Q function up to the temperature. But what is the softmax? The softmax is basically, it's a soft version of the argmax of our action. So instead of putting a weight of one on the best action, you will put weights everywhere, and the better the action, the higher the weight. So it's a soft version of what we did before. And the loxomex is a smooth version of the maximum. And so having this analytical solution, yeah, sorry. Are there cases where this update is exactly the same as the one without entropy? Like for some initial policies, is it possible that we do, like the result of each update will be the same as taking the argmax with the entropy as without the entropy? No, it's not possible because if you take the iteration without entropy, you will compute deterministic policies. So the policy you will have will be deterministic. And if you do entropy organization, the policies will be stochastic. The women on the interior of the simplex, they cannot be deterministic because having a deterministic policy will put some stuff in the optimization program to infinity. So you don't compute the same solution. If you consider entropy organization or generally speaking some omega function, you won't compute the optimal solution of the original MDP even without errors. And one thing that I should really add to the slides, I'll do it for the next time, is that what you do is that instead of optimizing for the summer forward, you will optimize for the summer forward plus minus omega, the policy in this current state. So you change the dynamic programming, you change the reward function, you change the return with this realization and you compute different things. Okay, thanks. So how to get a soft DQN algorithm from this? So we know that with the entropy, the greedy policy is softmax according to QK. And I can just do the same process for DQN. I know that this is the entropy of the policy. So I just have to change the update rule and I add the red term, the red term, basically it's the same as, sorry, the red term correspond to this term. So I just add this term, the log policy. Give, the policy is not an explicit object. The policy is just a softmax of the Q-function. So I have an update that depends only on my Q-function and I get a soft DQN just by changing the loss and keeping all the rest the same. And as the temperature goes to zero, we see that we get back to DQN in the sense that this guy will disappear and this guy, which is a soft rmax, will converge toward a hard rmax and it will be the max of our action in the end. You can write it differently in the sense that we know that the max pi K plus one dot QK plus the entropy of pi K plus one is the log sum X. So we can use this fact and you get directly that this loss function where instead of having the expectation of the Q-function minus the log policy, you have the log sum X. These are purely equivalent, mathematically. Numerically, it really depends how you got the log sum X. These are numerically unstable objects. There are well-known tricks to make them stable, but yeah, when you do, when you code things, you use a function that computes the log sum X and not compose the log of the sum and the X function, otherwise it won't work. And again, this guy will converge towards the maximum as the temperature goes to zero. It's not obvious from there, but I say it. And next, the question is what to do when I have continuous action? If I want, I have, so DQN, it's a pure critic approach. You don't have a policy object because you can basically compute the policy as R-max of our action or a soft-max of our action. But here, the policy, if you have continuous action, the policy cannot be computed because you cannot compute the soft-max because of the infinite number of actions. And the idea is to learn it. This is the base of actor-critic algorithm. So one point is that many actor-critic algorithms are presented in the literature as coming from the policy gradient theorem. But if you look at what they do algorithmically, like SAC, TD3, DDPG, and so on, they are basically value iteration approaches. So they still cover this kind of approaches. Even if they are motivated from policy gradient, the final algorithm is closer to value iteration. Anyway, so the first solution is to say I'll take a direct approach. So my greatest step is pi dot Q plus entropy. Here, the expectation of our action is pi dot Q. This is really pi dot Q plus entropy. The entropy is pi dot log pi up to the sign. So when I have the expectation of our action of Q minus tau log pi, this is really my optimization program that I want to solve. I know that analytically, the solution of this optimization problem is the policy being self-maxed according to my Q. I know it, but I cannot compute it. So I will use neural network for pi and I will take an expectation of the states I have in my data set around and I have a loss function and I will optimize for it. And that's it. The issue there is that this policy, you want to optimize over this policy and this is a bit cumbersome because with automatic differentiation, it won't work. And so what you can do is that these are more optimization tricks for being able to compute the gradient. You can use important something to say that the expectation of a pi omega, which is the policy you're optimizing over is the expectation of another policy time the policy you're interesting in over the other policy. You do important something. And then you can do the reinforced trick, which is quite common in policy gradient approach. And in practice, people don't do this. They will do the reprimandation trick, which is something different. So for this, you can check in the literature, but the important point is that the loss function for learning the policy directly comes from the greatest step you have in value iteration. And then this loss function is easy to optimize or not and you have to use some optimization trick to be able to optimize it. But the way to write this loss function is really take what you do in your DP scheme and then put approximation when you need some approximation and you get your loss function. And a second solution, which is how this is presented in the literature, is to say, I know that my optimal policy is a softmax of the Q function and I will minimize some distance between my neural network and what I know being the optimal policy. Here it's a reverse group by clever divergence and in the end, it's exactly the same loss function. So I think that seeing this as you just take your greatest step in value iteration, you put approximation, you get your loss function, it's much more direct and much more meaningful and it's also much more interesting in the sense that for the entropy, you have an analytical solution to the Legend-Fenton transform, but in many cases, it's not the case. You don't have the analytical solution, so you would have to rely on having a loss function even if the policy was with discretization. And for the evaluation step, it's exactly the same as for the QN. You just choose the same thing, you add the entropy and you're done. And so this is basically the soft actor-critical algorithm that I just described. That was not presented this way, but it's value iteration. And last slide, I guess, before a short break. So the question is what do we compute in the end? This is a good question. What do we compute in the end? And with what guarantee? And we can do the propagation of error, much like we did before. And the bound is this one. So we'll bound the difference between the optimal Q-function minus the Q-function of the policy computed by the algorithm, but it's Q2 or no. And this is a biased solution in the sense that Q2 star is not Q star. And as I said before, Q star will be the optimal Q-function for the sum of reward. The Q star tau will be the optimal Q-function for the sum of reward minus the regularization term of the policy. And this might not be a problem. It's biased, but maybe this bias is what you want. If you take the example of a large language model I talked about before, what you want is to remain close. So the unbiased solution is the Q tau. It's not the Q. But if you add some regularization for over reason, for example, for exploration, you add some entropy to have a better exploration. You should know that in this case you compute something which is biased in the end. And the bound in the right hand side is the same as for DQN. So this may suggest that it's not interesting to do so, which would be wrong because it may be interesting because the regularization term is something you really want to take into account because you have some pre-order knowledge which is important and you want to incorporate it into your problem. Or it may be wrong because it's only one specific analysis of this kind of things that overlook a lot of other aspects that you will have in reinforcement learning. For example, it doesn't take anything about the control aspects, exploration, exploitation, dilemma. It doesn't tell anything about how easy it's optimized, how easy it is to optimize the rated function. And typically when you take policy gradient approaches you can also add this kind of regularization and you will have much better convergence rates just because of this added regularization. And there are over reason and theoretical analysis that suggest that having this regularization is a good thing but from the purely approximate value iteration scheme viewpoint, we don't get anything. But we don't lose anything which is also a good thing. So I guess we can do a short break, maybe a five minute break before continuing. Okay, so now, cool back lab organization. And so we have the same scheme as before. I replace the entropy by the cool back lab or regularization towards a previous policy. And same as before and we'll do the same approach. We'll skip maybe continuous actions. So in this case, the greedy step is again, a Legend-Fencher transform. It's always the case. And I say, it's not that obvious but I say that the greedy policy is proportional to the old policy times the softmax. So it's a softmax which is weighted by the previous policy. And with a direct induction argument, given that Pi K itself is greedy is proportional to Pi K minus one exponential the previous QK and so on. You can show that this is softmax according to the sum of QK. And this is the link I told before between the way to recognize towards the sum of Q function or cool back lab or divergence being basically the same. What the cool back lab or divergence does in this context is that it implicitly assumes it is implicitly averaged the QK, the past QK you've seen. And what you can get is that you can get an equivalent approximate value iteration scheme that I call dual averaging. You have the same thing, kind of duality between mirror descent and dual averaging in optimization. But basically I say that I know that this will be softmax according to the sum of QK. And I know that the softmax is itself related to the Legendre transform of the entropy. So here I say that it's Pi dot HK, HK being the average of Q value, plus the entropy with some temperature because I know that the solution of this optimization problem is exactly this guy. So it's just a way to rewrite it. And HK plus one is the sum of, it's really the sum of Q value, the average of Q value. And I have the evaluation step which remains the same. So I add something new, the H, but it will be useful later. Yeah. I have some confusion about how we use some confusion about how we use the optimization. So I don't know if there is an analogy between this and having a prior where we sample from tasks and then we execute this task by conditioning on it the policy and then having a trajectory. And then it's like trying to minimize the bits of information that we are initially going from to bits of information that we cannot actually predict in order to make the task predictable. Is it related somehow to what we're doing here or it's different? Yeah, you can see this as, so the question is can we see this as having a kind of prior of a policies? So yes, yes, you can see this like that. I'm not sure to, so you may have this connection in the literature, more on the literature in a control as probabilistic inference, but because you will more likely work with the Bayesian stuff. But here you can really see the policy towards which you recognize as a kind of prior that you don't want to move too far from but you want to change still. You can interpret things like that. And this prior in the previous case of entrepreneurial organization is just a uniform one and it's kept fixed. While here it's more like you solve a succession of problem where you change your prior at iteration. And one alternative way would be to say, okay, I have some policy and I take the cool back lab of respect to this policy and I solve for the MDP until the end. I get an optimal policy and this will be my next prior. And then I will use it for new MDP and so on. And here what we do is that we optimistic in the sense that we change the prior at iteration of the algorithm, which is more efficient because you have to do less compute, less samples and so on. So a practical algorithm from this is this one. So the issue now is that if I have a Q network, I can compute the softmax over the Q network but waiting it by the previous policy which is itself a softmax over the previous network is not easy. And so if you take things in this form you don't have really a choice. You have to introduce policy network because you cannot compute this guy. If the parameterization were linear, it will amount to make an average of the parameters that works, but with neural network it's not the case. And so you have to, if you want to compute it over this form, if you have the previous policy, you could do it, but this previous policy itself has to be represented. If it's implicit, it's this guy. So we'll see later a smarter way to do this, but for now the way to do this is to say, as before, okay, I have the greedy step. I wrote my greedy step as expectation of a stuff. So it's just a dot product which is written as equivalent expectation. I put an expectation over the dataset I have around this and I get a loss function. And then I try to optimize for this loss function. And I have to introduce some target policy network and I have my current policy network and I optimize over this current policy network. And that's it, I have a loss function. It's not the best possible approach. Another approach would be to say, okay, I know that I should be softmax according to the sum of Q function. So let's try to learn the sum of Q function using neural network. So by the way here, I say that I try to fit on H, one minus alpha, the previous H plus alpha times the Q value which is a moving average instead of having an average of the Q functions. And this could work also in practice. So this approach, we tried it, it works. This approach is called momentum DQN because it was introduced from a different perspective but it's really this in the end. It was introduced from a different earlier perspective. And for the evaluation step, as before, you just take the exact equation, you put approximation, everything and you're done. And the question is what do we gain theoretically? Because it's a bit more complicated. One thing or so is that I think can only cut example of this trust region policy optimization or proximal policy optimization that you will maybe cover this afternoon. So you will see them this afternoon. And what they do is they use this idea of trust region as we call them. It's really the same idea. I don't want to move too far from my current policy. And they do it using different approaches. But the core idea is the same. But this will cover this afternoon. And so, not this after one. And so what do we gain? And this is what the bond we had before. It's approximate value iteration and we had the sum of the norm of the error. And if we look at the bond we get with this kind of realization towards a previous policy, what we get is that instead of having a weighted sum of the norm of the error, we have the norm of the average of the error. And it means that if your errors are well-beheard in the sense that there are zero mean IID over iteration, this means that this guy will go to zero. And it's super important to be able exchanging the sum and the norm are not commutative. And it's really super important. It's very super different. And here you have an empirical illustration on just the tabular NDP. It's a very simple setting, but this is value iteration that doesn't converge, approximate value iteration that cannot converge because the errors never goes to zero. But using a feedback labor organization, you do converge to zero. So you can compensate for errors along iteration. The thing is that if you look, you have another thing is that here you have a square dependency to zero reason, while here you only have a linear dependency of zero reason, which is also a very strong thing. And you lose something here in the sense that no, the convergence is sublinear. But it makes sense because value iteration has a linear convergence and what, how do we modify value iteration? It's by saying that instead of doing the big update, we do a much smaller update. So if you do a much smaller update and if you have no errors in the end, it will be slower. You have to pay it, there's no fringe. And, yeah. Just going back a bit, like last slide. Yeah. On the very left-hand side, should this be Q star tau on there, yeah? No, no, in this case, no. That's an excellent question. In this case, I don't have entrepreneurialization. So I don't realize so far something which is fixed, but I realize what's my previous iterate within the algorithm. And in this way, what I will compute in the end is the optimal solution of the original MDP. I don't have a bias as before, but I could add some entrepreneurialization and get this bias back. Thank you. Changing the bounds too. But yeah, yeah, that's an excellent point. So in this case, it's not obvious that in the end you compute the right quantity. It's a case. So a quick overview, I will go quickly over there, but this is quite not exhaustive at all, list of papers that deal with some form of organization. And what you do is that basically you can recognize ever the evaluation or not. As I said before, you always recognize the greatest step. The evaluation step, you could recognize it or not. You can put entropy, could by clover, both. And then you have a bunch of algorithm also depending on many of our aspects. One thing is that part of this algorithm have not been introduced with a regularization view point. The connection to regular... In the end, you can derive the very same algorithm using the regularization perspective, but it was not the case when the algorithm was written. For example, Momentum DQN was introduced as a kind of way, seeing this gradient view of Q-function, it was introduced as a way to do like a momentum in gradient in optimization by summing the Q-function. And the link to regularization was made afterward. So there are a bunch of work that use this kind of ideas and many more know in practice with LLM again. So I think I will skip this part, but basically you can get analysis of what happens for depending on if you put entropy, cool back, both in the evaluation step, in the greatest step or both. And this basically is, this is known as Softmax DQN. You put entropy there, but you don't put entropy there and this may not converge even without errors. This is a non-failure case, in the sense that you have a kind of implicit Bellman operator by and this and this Bellman operator is not a contraction. You can build example where this Bellman operator has more than one fixed point. The fixed point is not unique and so on. So the TLDR here is if you change this guy, change the other one too, it will be much more simple and in practice it may not change much, but it's nicer I would say. So I'll skip a bunch of slide. So yeah, it's basically you have bounds that resembles quite a lot as the one I've shown before. And now the question is, we have, cool back rubberization is nice because it implicitly average the Q function and by doing so you implicitly average the error and you can converge when value iteration cannot and that's cool, but I want to do deeper enforcement learning and in deeper enforcement learning, we have this equivalence, these are the two scheme I have shown before. If you do this, as I said before, you can put an expectation, write the dot product as an expectation, put an expectation of the states on your dataset and define a loss function, optimize for it, but if you do this, you will do error there. And my analysis, my nice bound, does not suffer from having error error. If you add error here, you lose the average of error and you can say, okay, I'll take this form then and if you take this form, you have errors in this approximation because you cannot just sum things, except if you remember all the past Q networks, but this will be quickly super expensive. You want to avoid having a memory of hundreds of networks. So the two schemes are not equivalent and so my analysis does not hold and so why doing this in reinforcement learning, in deeper enforcement learning? So the analysis came because there were super efficient approach doing this, especially in the policy gradient area with things like TRPO. So it works, it's more like how to make the algorithm closer to the analysis such that maybe it will work even better. And also this thing is not really compatible with stochastic approximation. Basically if you don't replace this by moving average, might not work well. And a remedy is a München's entry. So München's reinforcement learning is a paper but it is basically how to do a practically cool back lab organization with the theoretical guarantees within a deeper enforcement learning context with discrete actions. But the trick is more widely useful and I think useful to know. And so we can start from this and recall that the cool back lab between pi and pi k is pi dot log pi minus log pi k and you can put the log pi k with this guy and so on. You can write a bit, you can check the paper, I have not done the derivation here but it's pretty simple. And what you can do is that, so I write a Q prime k here. And Q prime k is really the Q function I was talking about before, the same as Q k before. And then I just do a reinforcement decision trick that is I will define Q prime k plus alpha two log pi k as Q k. And I write everything as Q k and it works. And in the end, this is the scheme I have. So it's really like if you take everything but the red term, it's soft DQN or value iteration with entrepreneurialization. And here you have the München's term. And that's it. So the thing, the idea here is that instead of learning, this is homogeneous to a Q function, the Q prime. This Q k is homogeneous to a Q function plus the log policy but the log policy in essence we'll see that later is an advantage function. So what I do is that instead of learning the Q function or instead of learning the Q function and then trying to learn an approximation of the sum of the Q function, I have a clever way to learn directly the sum of Q function because in a sense Q k up to a state baseline that I don't care about is the sum of Q prime k from the past. So this is a clever trick to learn directly the sum of a quantity instead of directly the quantity. And it's useful in overall context too. And that's what München does. And so you can modify the algorithm as a function as follow. So this is the DQN loss as before. I add some entropy organization as before. It's a blue term. I add the München term as before is a red term. And that's it. It's just a teeny change of the last function in the end. And what is super important here is that the log policy has a different sign. With a minus, it's entropy. With a plus, it's not entropy. So I'm not just writing an entropy term elsewhere. It's pretty different. And one way to think about it also is that when you do DQN, what you do is that I don't know if I knew the optimal Q function, Q star, I will apply this update. I will get the optimal Q function, but I don't know the optimal Q function. So I replace it by my best current estimate which is my target network. Now if you take this term, if I knew the optimal policy, then taking the alpha log optimal policy, say the optimal policy is deterministic, it will be zero for optimal action and minus infinity for suboptimal actions. So just bootstrapping this term would give me the optimal Q function or something which has the same best action because I would set everything which is not optimal to minus infinity. So you can see this as a way to kind of bootstrap the information you have in the policy too. And that works well because if you take, this is DQN and this is München and DQN just by changing a bit the loss by adding the alpha log policy. So it's a small change algorithmically, but it make a big change empirically. And it works because it works thanks to the fact that Kuhlbach Labor Organization is doing error averaging over iteration under the hood. And yeah, it's an improvement of our games. So you can check the paper, but the thing I saw is the nice thing is that you have our improvement of DQN for example, IQN which stands for, it's not implicit-contile networks. And IQN is a distributional reinforcement learning approach so I don't know if you had a course about it, but it's quite orthogonal to what I talked just so far. And you can add the München and DQN to this IQN. It's the same thing basically, you add the alpha log pi to the policy and you earn things and it works better. So why does it work? Another viewpoint is the following one. So we add the alpha log policy term to my basic soft DQN. So soft DQN in a sense, from my own experiment at least, soft DQN doesn't really work better than DQN, except if you tune really the entropy for each problem you consider, you can get a bit better result, but there's no consistent improvement of soft DQN over DQN, at least on the Atari games I just shown before. But so what changed is really the log policy term. And if you look at the log policy term, the policy itself is softmax according to the BQ, but BQ is no homogenous to some of real Q-function and not to Q-function. And it's Q minus the log sum x of the Q just by a basic computation. And in the end, if you take the limit as it goes to zero, you get the rho plus alpha times the Q-value minus the max of our action of the Q-value in the current state. This is the advantage function and this scheme is known as advantage learning. So an advantage learning is something that we know can work well, but so far was not really analyzed. We were not really understanding why it was working. There were some analysis, some argument for why it was working, for example, on the fact that there are some consistency. But what we saw is that advantage learning in the end is doing a form in the limit of cool back labor organization. And this connection is not super well known in the literature, but I think it's super important. And the thing is that if you look at the advantage function, it defines some of the action gap. And the action gap is that say I have the optimal Q-function, Q-star, and I take the difference between the first best action and the second best action. If this difference is super small, then it will be hard to learn the best action because if I have more noise than the difference between the two action, it will be super hard to learn. And what you can show, it's done in the paper, is that if without cool back labor organization, you have an action gap, which is the same gap. You add cool back labor organization, and you will get a gap that will be one minus one over one minus alpha dilatation. And typically we'll take a practical alpha that will be dot nine. Alpha is the weight between cool back labor and entropy organization in this case. With alpha equal to dot nine, you will have a time 10 increase in the gap. So it does not always, you cannot always this time 10 empirically, but it works better and we can observe indeed an increased action gap in practice. And basically the increased action gap may say that you can more easily learn your Q network. I guess five. So I'll go super quick on the last thing, there are a lot of related topics. So one is that, so this is a part of theoretical analysis that ask if I have access to a generative model, that is I can query simulator for each state action couple, I can call the simulator and time for whatever state action couple I want. And the question is what is the minimal time, number of time I have to call my simulator. How many samples should I use in this very simplified setting to learn an epsilon optimal policy? This is called the sample complexity. And basically this slide tells that we have shown that doing pullback labor organization is minimax optimal. That is you cannot do minimax optimal in the sense that there's no algorithm that will need less sample than doing this. There exist over approach that are minimax optimal, but they're either model based or they're model free, but much more complicated. So München is super nice, but the issue is that it works only for discrete actions because you still need to have the softmax. And this is something which I think is also interesting is that if you want to have, it's a way to have a softmax, if you want to be softmax exactly according to your Q-function with continuous action, it's not possible, but yes it is. And how is it is that you will define your Q-function implicitly as two log pi, pi is an explicit policy network plus V of S and V is an explicit value network. So instead of having a Q-function and a policy, you have a policy and a value function and you define your Q-function implicitly as the log policy plus the value. And if you do so, it's direct, but the softmax of the implicit Q is the explicit pi. So you have an explicit policy and you can replace this and get a loss and that may not work super well because what you do in continuous control is that you will often take a Gaussian policy because it's simple, which means that this guy is Gaussian, which means that basically you force your Q-function as being a function which is quadratic in the actions. And this may be an object which is not rich enough to represent the Q-function or to bootstrap the Bellman operator. And so you have an exact softmax, but you have moved the problem elsewhere, which is on the representation towards the policy. You can use this kind of ideas for exploration in the sense that what we've seen so far is that you remove, you have a could-backed about them that say I want to stay close to this previous policy, but you can add a change of sign and say I want to deviate from this policy and to deviate from this policy to perform some sort of exploration. And this works pretty well in some setting. And so on, this is, so I don't know if there's a lesson on offline reinforcement learning. It's quite a topic currently. It's basically you want to learn an optimal Q-function but you don't have access to the simulator. You just have access to data. This is really super hard. And many approach what they do is that they try to recognize towards the barrier which is in the data. And so you have more or less explicitly a lot of regularization ideas there in this setting. In imitation learning or investor reinforcement learning, this regularization is important to form mostly technical purpose. And for game theory or multi-agent reinforcement learning, again, I don't know if there's a lesson on this, but when you do a game theory, when you take any kind of approach, you want to learn to play against some opponent. You have to make a lot of assumptions about the game and so on. But in the end, you have not to play about the strategy of your last opponent, but about some kind of averaging of what your opponent did in the past. Otherwise, it won't converge. So when you do games, you have to have a rage some more, more or less explicitly, and you can use this kind of trick, as a munchers and trick, to learn implicitly the sum of something. And I guess that's it. Thank you for your attention. Question or break? Questions. Yeah, so I'm a little curious about the sub-optimality comparison between the KL regularized and the vanilla Q iteration, or sorry, value iteration. Now, what you showed in your graph was that there's some dependence on the convergence bound with lambda, right? The KL regularization, like learning rate type parameter. Yeah. But the bound has no dependence on this. And furthermore, it has a dependence on tau which looks monotonic. There's no trade-off between different kinds of terms. I mean, it looks like a regularization type term that you might see in a regret bound. Now, yeah, I was just wondering if you could sort of make a few comments about that. It looks like the sub-optimality bound that we're seeing doesn't really capture all of what you see in practice. Do you? Yeah, yeah. Yeah. So, when you have only lambda, basically the higher the lambda, the slower the convergence, you have these two terms in the bound. Having a higher lambda will make this a convergence slower, but having a higher lambda will also make the averaging less efficient. And this is when you have only feedback labor organization. If you have both, you don't have a pure average. You will have a kind of moving average between a moving average of the error. So what we will forget is a moving average. So it's not sufficient for canceling out regularization, canceling out the noise, but it's sufficient for reducing enough the variance, for reducing the variance of the sum of noise. Yeah, but then you might expect to see some partial dependence on the infinity norm of the step-wise errors, right? But you don't see that at all in this bound. Like there seems to be a discontinuity. Okay. So the thing also is that v-series that I will do will depend on the kind of regularization I do. Because they come from fitting this regularization term. So there's hidden dependency of v-series to this. And in general, it's super hard to say what will be the influence of what we do before in this error that we, in this case, influence again. But if you take, so, sorry, this, this paper. So it's a paper showing that it's minimax optimal. In this case, we do a quite specific setting that is we have access to a generative model. And we do this kind of scheme by seeing that for each iteration, for each state action couple, we call for M sample of the next state. And then we do the MDVI step. And then we say how to tune the temperature of lambda, how to tune the non-temperature of tow, how to tune the number of time you call for the generative model, such that in the end, you're minimax optimal. And we optimize over these bounds. But you have to make this error explicit if you want to get this kind of results. Empirically, if you take moonshows and DQNs, the alpha is basically the ratio between, you have alpha and tow. Tow is pure temperature somehow, how much bias you want to put in the MDP. And the alpha is the ratio between the cool back lab atom and the entropy term. And basically, the TLDR is take alpha to .9, it works. Okay. Thank you. Maybe we can stop here and take the questions offline so we can ask to Magier directly and go to for the coffee break. Yeah, okay, thank you, Magier. Thank you.