 Welcome back everyone and the next lecture is going to be by Gerega Noi who works predominantly on online Optimization bandits and theoretically reinforcement learning. He also has an ERC which he uses to support a lot of us volunteers in our current research and Give him a warm welcome. All right, cool So so this is the part where I should be thanking the organizers for inviting me But that's kind of awkward in this case because I'm supposed to be part of the organization team and And it's I always find it sort of sketchy and awkward when organizers invite themselves to give a talk But I guess in in this case, it's it's not a big problem because I didn't do any work with the organization. I Don't know. Maybe just making it worse. I can I should stop this monologue right here, but but but yeah, so I guess all credit goes to Mattel and Vincent for leading for leading the effort in organizing the school and also for the team of volunteers that includes all My students and everybody all the other students in our team. So they're wonderful I don't know. Maybe this is the right opportunity to thank them a little bit Because we're gonna we're gonna have some other opportunities to thank them again on Wednesday at this Mysterious concert is gonna happen Right, but but let's talk about reinforcement learning because that's why that's why you're here, right? And that's why we are here So so far in all of these lectures And this summer school, I think literally all of them We've been looking at reinforce and learning methods that were derived from the perspective of dynamic programming value functions optimal value functions Balmain equations and so on and and what I'm gonna be doing today is Some kind of feedback Maybe your mic is on that's that's what they call a professional musician, right? so Okay, okay nice good. So so what what I plan to do in this in this Lecture is give you a bit of an introduction into an alternative framework For developing reinforcement learning algorithms an alternative framework for studying sequential decision-making and optimal control problems And this is gonna be Something that can be a little bit unusual for some of you. So I try to go like a little bit slow And with not so much focus on the most recent and fancy results But rather on the foundations and the basics trying to give you some basic understanding of where this framework comes from And how to derive some algorithms from that, right? So let's see if this kind of funky set up works And I set up for myself here Right, so this is a so this is the rough outline of this of this talk I'm first going to be giving you I suppose the fastest introduction to mark-up decision processes because they have seen this like a million times by this by this time already and That's what I'm going to be talking about this funky linear programming framework for mark-up decision processes and I'm going to show you How to derive some algorithms from this? In particular, I'm going to be talking about so-called primal dual methods for solving mark-up decision processes And if there's time at the end, I'm going to show you Some relatively recent results that we have developed for linear function approximation in this scenario But I suppose this material gets like quite a bit technical. So How we are I will basically focus on the first two bits here. All right, okay So that's our favorite three words together Mark-up decision process. I suppose this slide is really just There to fix my notation MDP models a sequential decision-making problem in which a learning agent Observes the state Xt of the environment. So importantly, I knew I use Xt To denote states. This is notation that comes from the optimal control literature So in each round we observe the state of the environment Xt We take an action 80 taken into account this state and all the history that we have seen before and As a result of this you're going to obtain a reward That is some function of the state and the action and the environment is going to give us a new state That is generated according to this Markovian rule Which means that the distribution of the next state is only a function of the current state and the current action And that was taken and then our goal from perspective of the agent is to figure out a way of picking our actions Selecting a sequence of actions such that some kind of numerical objective that captures long-term rewards is maximized and The specific setup that I'm going to be focusing on in this in this talk is this discounted return Which is sort of the most popular one that he studied in the literature and has been most studied here in this summer school as well So precisely we're going to be looking at the infinite sum of rewards of writing red on red. So that's very efficient, but But what we're going to try to optimize is the sum of discounted rewards discounted by this gamma discount factor I don't think I need to explain that too much so So there are some basic facts that have been established several times over the week So I'm going to like recap them real quick You can find all of these facts and well any of these famous books like the Sutton and Bartow reinforce and running The MDP book by Martin Puterman or the dynamic for growing books by Dimitri Bersekas so Since we know that the next state Distribution is only a function of the current state and the current action And we have a stationarity property as well, which means that the transition function does not depend on time We know that it is enough to consider stationary policies Which only look at the current state and do not remember anything about the history And then produce a sea Prolative distribution over actions So we are going to be happy with this and we're going to be deriving algorithms that produce such stationary potentially randomized policies and We're going to be using this to develop our theories Of course, there are many other beautiful properties that are useful for deriving algorithms For example the existence of a deterministic optimal policy and also And also a guarantee that An optimal that there exists an optimal policy that is optimal no matter what starting state distribution We initialize the process from but in our model You're going to attach some importance to the initial distribution of states as well So, right, I think this is this is something that should be emphasized for the purpose of this talk This part right here that says that x zero That's a zero is drawn from some fixed distribution new zero at the beginning of the process and then all Consecutive states are generated by the markup decision process. So in the framework that I'm going to be talking about This is going to have key importance for dynamic programming. This is not necessarily so important But we're going to see that for for linear programming. This is going to be something curious and nice all right, so Pretty much all the algorithms that we've been looking at during the week are based on In one way or the other Based on the Baman optimality equations, which is a nonlinear system of equations that characterizes something that is called the optimal action Value function that establishes that the optimal action value function equals the immediate reward plus gamma times The optimal value of this of this action value function in the next state Right, so so that the value function decomposes into two bits and What's nice about this optimal action value function is that it directly encodes a policy So basically just by being able to solve this system of equations and being able to find a solution for this System I immediately find an optimal policy that is going to maximize my long-term rewards in particular I know that a greedy policy with respect to this Q star directly encodes an optimal policy and as famously shown by Richard Baman in the 1950s and Others at the same time in the context of optimal control and game theory a solution to the system of equations Can be found using a recursive procedure that is called dynamic programming Right, so we've been looking at all sorts of methods that have been derived from this perspective including TD methods Including approximate dynamic programming methods like DQMs Q learning and so on so all sorts of approximate value iteration methods and approximate dynamic programming methods have been Developed using this framework as a starting point Facing the two challenges the two key challenges of reinforcement learning which are Which are the fact that The expectation where the next state x prime that that appears in this in this little formula here Can typically not be exactly evaluated in real world applications so we need to figure out a way of replacing this expectation with Some approximate versions there of try to approximate the next state distribution in some in some clever ways so this is one of the challenges that you need to face in reinforcement learning and The other one is of course is that if the state space or the action space is very very large Then there's simply no hope of finding an exact solution for this system and we need to rely on some approximations as well so sort of what happens Morally in this framework is that we start from a set of equations, right that is derived from a methodology that is developed for For a fully known system with fully known transition function and reward function and then we just Work from this as a starting point and use it as an inspiration for deriving Algorithms so we're going to be doing the same steps for an interesting alternative framework for sequential decision-making so our goal is to study a sort of alternative theory that is parallel to this dynamic programming view and Try to make the same steps to this right so we're going to figure out the optimal solution derived from the framework that we're going to develop this is going to be the linear programming framework and We're going to use this as sort of a spiritual foundation For developing algorithms that can deal with samples and that can deal with function approximation and large-scale state action spaces All right, so this is the framework of linear programming for Markov decision processes and it is based on the following little Calculation the following little observation so the objective that you want to optimize in In reinforcement learning or in optimal control in MVPs is the following. This is just a discounted sum of rewards For policy pie. We have agreed that it's enough to consider stationary policies So we are looking for a policy pie that is going to maximize this quantity right here Defined as such so if I do a little bit of a Genetics on this on this object I can rewrite it in a very convenient form that is going to guide us into algorithms So the first step that I do here is notice that this gamma factor and this sum is not random Right so I can just swap it outside of the expectation And I only need to consider the discounted sum of expected rewards under the policy that I'm considering right So if I use the definition of what an expectation is I can rewrite this whole thing as follows So an expectation is really just the sum of rewards that are weighted by the probability of Seeing each state action pair right so I just rewrite this expectation As a sum over the state action space, right? So again reward of the particular state and action weighted by the probability of Landing in this state action pair in rounds T given that I start from new zero so and then As any good mathematician when I see two sums I swap around the two sums right because that's always what you have to do When you see two of these so you just some this or swap the sum over states and actions and times and Then and then you cry out that oh this looks like some kind of an expectation, right? So this looks like a sum over rewards and a sum over some other object This one right here that is going to be of some interest for us so If I do one tiny little step, which was just to multiply and divide with one minus gamma, right? Which I do for convenience and you're going to be understanding in a second why I did this So if I multiply and divide with one minus gamma, then I can Give this object here colored in red a special name, right? I'm going to call it the discounted occupancy measure of policy pi and I did all of this Because now I know that this discounted return objective that I'm trying to optimize our gamma of pi the discounted Total reward of false advice can be rewritten in this very convenient linear form, right? So my objective up to some normalization can be written as a dot product between this object that I call the The discounted occupancy measure and the reward function So what this little calculation shows is that this optimal control problem that you're facing Has some kind of a hidden linearity property the reward that you're trying to optimize is Linear in an appropriate representation, which is this discounted occupancy measure So so what is what is the what is the meaning of this? Oh, yeah Yeah, so my question is wouldn't the expectation also include the Initial state distribution. Oh Yes, right. So that expectation also Is with respect to the initial state distribution as well, right? So what this what this notation e pi means is that there's somebody calling me It's a bit of a mystery Right. So so what this notation e pi means is an expectation over the stochastic process in which State x zero is drawn from the initial state distribution The actions are taken with respect to policy pi So so each action a t is drawn according to the policy pi in question and the next state is drawn from x t a t and Yeah, the the the initial state is tricky and and the choice of the initial state distribution Influences what this what this occupancy measure is going to be like at the end of the clarification, right? So so So let me let me give you some more intuition about what this occupancy measure is because I think it's a bit of an unusual concept so if you if you look at this sum that I sketch here, maybe if you just look at This sum the way that it is written here. So what this what this is Essentially counting is the discounted number of times that I visit state action pair x and a if I visit if I follow policy pi Yes And is this Also referred to as the successor representation In some kind of like neuroscience literature, right? Yeah, so it has a lot to do with a successor representation Let me just finish this thought and I'm done and then I'm going to connect the successor representation indeed right so so what this What what this object is trying to measure is the discounted number of times that I visit a certain state action pair If I follow policy pi and I start from the initial state distribution and do zero, right? So this is going to tell me How many times I visit a certain state action pair Under this process discounted with this gamete So earlier visits are going to matter more and later visits are going to matter less and less and the reason for And the reason for adding this normalization constant one minus gamma in front is to make sure that this Is going to give me a probability distribution, right? So this is sort of there to normalize things in a way that we can Think of these objects as fraught distributions and it is going to make our lives a little bit easier This normalization, right? so so the connection between this object and what is called successor representation in Some areas of neuroscience and also reinforce and earning is that in the successor representation What you count is the number of times that you visit each state Given that you start from a particular state, right? So it's conditioned on the initial state and what this is is essentially an expectation of that under the initial distribution right so in in a successor representation what I would have here is Instead of this instead of this probability I would have a probability That is conditioned on x0 being some Specific x prime and and then and then what is and and then what can be said about that according to this little calculation? Is that if I call? This object mu pi Given x prime What I can show is that the value function At x prime can be also written as a similar linear form Like this, right? So this is this is this is very easily seen from the same calculation that I and that I wrote here Because the only difference between the total discounted return and the value function is the conditioning in the on the initial state Yeah, and I suppose this is true for the for the normalized function. That is Do we absolutely precise? All right, so this is a key object in The area or in the or in the framework that I'm going to develop here. So really what it is Is that is that legit is the real Okay, maybe I'm going to like Switch over to Google meet. I think that this the dissum is bullshit. All right. I'm gonna. I'm gonna do something more Legit. Yeah, the thing is that I just really don't know how to use zoom Now Sorry about that little awkward Technical break you can I can think about some clever questions. I suppose in the meantime, I Think Google meet is gonna have the same problem, but I think this is already what is going to justify the break in the middle Okay, there we go. There we go. Okay, cool. Sorry about that. All right, so So really the reason that I did all of these gymnastics with the math is to rewrite the Is to rewrite the the objective that I have in mind and this nice linear form, right? I think that's as good as it gets really an endless stream of notifications coming up. All right So the reason that I did this is to achieve this nice and linear form and I hope to be able to exploit this Using some some reasonable framework. So you just think about an example, right? So what I plotted here Was the occupancy measures? For some policies in a two-dimensional state space So these are just some large grid volts and what I plotted here is the occupancy measures for six different policies So the first of these policies is sort of trying to go to the lower left the second of these policies is just Hovering around the initial state the initial state is denoted by this little red dot there The third of these policies is going in that direction and so on so all of these policies are moving into different parts of the state space And what I what I am plotting here is just the discounted number of times that They visit each state action pair. So so if I Say that I want to minimize or sorry maximize this reward function Which is something that gives high rewards here in this lower right part of the state space Then all I need to do is To find the occupancy measure or rather the policy whose occupancy measure Spends most time in that lower right corner of the state space. So which one is that going to be? Yeah, I think it should be six so so this is of course like you know nice and intuitive and you can you know think about this as As like a nice little toy to play around with but the question is again, how do we do this in a systematic way right? How do we do it is without having to enumerate all of the policies and having to stare at plots like this? How do we figure out what is an occupancy measure? That that has maximum inner product with our reward function So that's the question that that is that is a very natural one to ask and the following observation is going to be very helpful for us so So key property of occupancy measures is that they satisfy the following so-called flow property it essentially establishes that The state occupancy in any state so notice that if I sum out the actions then what I get is just the number of times that I Know the discounted number of times that I see each state in the trajectory. So this is equal to The the initial probability that I start from this state, right? So this is the first part of this sum, right and And Gamma P transpose times mu pi. So let me just resolve this notation for you. So this P transpose Mu pi This is something that turns a state action distribution into a state distribution And what this gives is the distribution of states that I see if I draw a state action pair randomly from the occupancy measure mu pi and then I take a step forward according to the occupancy measure so what this So so this is going to be a probability distribution over states that is given by starting from The occupancy measure P mu pi and then taking a step forward In the transition function, right? So I start from the state action occupancy measure and I take a step forward in this in this probability distribution and this is going to give me the next state distribution. So what this recurrence relationship means is that The occupancy measure occupancy measure itself can be written as a mixture of the initial state distribution and something that relates The occupancy measure do itself. So so let me give you Some more intuition because I know that this is like a bit of a foreign object. So so let me just mention the case Where gamma is Gamma is one Which means that there is no discounting and we are considering the undiscounted some of returns so in this case this formula becomes very easy then it essentially Establishes that the state occupancy should be Such that if I start from the occupancy measure and if I take a step forward in The transition Function then I arrive back to the very same distribution So this means that the distribution that I'm considering is sort of invariant under the state distribution so Or maybe yet another explanation is that if I consider some distribution over the over the states or over the state State action space so this is some distribution not necessarily an occupancy measure and if I Push this through the occupancy measure, then this is going to give me some other distribution So so if I push it through the transition kernel, then this is going to be the state distribution that I get afterwards and And an occupancy measure is essentially a special state action distribution That is invariant under this transformation, right? It's such that if I start from this distribution take a step forward in the transition kernel I'm going to arrive back to the very same distribution. So it has this nice stationarity property Right, so I think that the this question All right, okay, so yeah Yeah, so so this so this limit of gamma equals one. This is this is kind of weird, right? So this is not exactly optimizing the discounted return. What this is optimizing is the average infinite horizon return Yes, but this is sort of like an easy Special case to think of when considering occupancy measures. Yes So the P matrix is it the transition matrix under the policy pie or the transition matrix of the mdp, right? So this so this P is yeah, this is this is a bit of quirk in notation Matthew has also used this in the morning. So this is an operator that maps So the P operator the P transpose operator it maps state action distributions to state distributions So this maps from a vector over x times a two vectors over x So this is so this is the transition function of the mdp including the actions and And and similarly the the P operator if I write something like P times V That is a state action vector that evaluates The Expectation of some function V over the next state given X a It's a P maps from X to X a and P transpose maps from X transpose from from from from X a to X Right, so this is just some convenient shorthand linear algebra on station. That is that is sometimes useful right, but I suppose really the the bottom line is that is That this is that is occupancy measure Satisfies this kind of a very simple property And it can actually be shown that that all occupancy measures satisfy this system of equations and This property also goes the other way around as well So every probability distribution that satisfies this property must be an occupancy measure as well So mu is an occupancy measure if and only if it satisfies this system of equations. So what does that? Again here this E transpose Mu notation is something that just sums out The actions From you and this is an another simple linear operator that allows us to write the system of Equations in a very nice compact form right. So what we know is that every mu satisfies This system of equations and this is something that is incredibly useful because now I can just simply turn my Optimal control problem my task of searching through The space of occupancy measures into a simple linear optimization problem So all of this stuff that I've explained Establishes the fact that Finding an optimal policy in a markup decision process can be equivalently written as Solving this linear optimization problem, right. So what am I looking for? I'm looking for a mu that Maximizes this dot product, right? I'm looking for a mu under which I have Maximal expected reward and the thing that I require my muse to satisfy is that well I want my muse to be occupancy measures, but I noted all occupancy measures Satisfy this system of equations these constraints So I only need to solve this optimization problem Maximize mu times are subject to this system of linear constraints So you say again Yes, and and it is sufficient as well. So this is this is what this theorem is saying, right? right, so this is saying that Mu is an occupancy measure if and only if it satisfies this system Right, yeah, that's that's that's a very important. Sorry. This is a follow-up on that question Yeah, I assumed when you wrote this that mu is a probability measure But is this saying that the right-hand side guarantees that it's a probability measure? Yes, in fact the right-hand side guarantees that this is a probability measure because new zero is guaranteed to be a probability measure, right? so Right, okay I guess the additional constraint that mutates is satisfies it that it needs to be non-negative So that is that is that is what I'm not writing out here but any non-negative mu that satisfies these constraints has to be an occupancy measure and Right, so if I if I fix new zero, which is a probability measure, right? Then and if I take any positive solution of this system for questions And I see that I cannot rescale solutions because the new zero fixes the scale, right? Because on the right-hand side if that is one minus gamma times a probability distribution and on the right-hand side I have a probability distribution then I'm gonna have like a gamma times probability distribution Corresponding to P transpose mu. All right any further questions because I think that this is rather important to understand If you want to follow what I'm about to say Right, so so oh, yeah, sorry Yeah, I just wanted to clarify maybe the intuition why is it called a flow constraint? Is it is in an outflow? I guess right? Yeah, so so so this is this is called a flow constraint or sometimes called a Bellman flow constraint even though Bellman did not invent this and I think that this is like a really recent terminology But anyhow, it's called a flow constraint because you can you can think of you can think of This system of equations as a flow. So this is The this is the mass that is in some probability in some state x, right? And this thing on the right-hand side is whatever mass flows into this state This is the mass that flows out from a state through the variety of actions And this is the mass that flows into the state from the from the initial state distribution and From the previous state distribution as well. And this is some kind of like a mass preservation Equation if you think about it if you think about deterministic transitions, then what this becomes is really the usual flow constraint on on a graph In Monday pay program. There are another major that is called a belief state. That's similar Right. So so once you go to partial of sorority things become a lot more complicated. So there the belief state is It sort of determines a sequence of probability distributions right over over states and that is essentially like an infinite sequence and It's just getting more and more complex as we treat over Over on states faces, but but here this is just one single probability distribution over states That measures the number of times that I'm going to be visiting Future states, but the belief states satisfy similar dynamic programming equations as well But somehow that's ends up being a lot more complicated. All right good. So So let's talk about this linear programming business so Well, now that you have agreed that in order to find an off and so on find the optimal policy or find an optimal solution in My sequential decision-making problem I know that I can simply look at this LP and I can find the occupancy measure that optimizes The expected reward the question is okay How do I turn this into a policy right because in reality I want a policy I want the way to make my decisions, right? I don't actually care about the joint distribution of states and actions if I'm trying to produce a solution to my problem but I rather need a policy and indeed a policy can be extracted extracted from the state action distribution essentially just by Conditioning on the state and looking at the conditional distribution of actions given the state, right? So I can just extract this very easily and another Very interesting fact that I will return to in a minute is that this Primal linear program has a so-called corresponding dual linear program as well So those of you that know about the theory of linear programming know that for every linear program There exists an equivalent dual linear program as well in which there is an objective function that corresponds to to constraints in the original Linear program and there is a set of constraints that corresponds to each variable in the original problem So the dual linear program for Markov decision processes is stated in these terms And if you stare at this a little bit you may realize that this looks an awful lot like the Bellman Equations the Bellman optimality equations here. What you have is r plus gamma times PV which can be thought of as a Q function and Indeed it can be shown that the solution of this dual linear program is is The optimal value function for the NDP. It's V star I'm not going to talk about this aspect all that much But it's sort of important for deriving some of the algorithms that the dual LP which takes this form minimize De-expected initial value Subject to the constraint that the value is greater than The reward plus gamma times PV So it's very easy to verify that V star satisfies this property this inequality holds in every state action pair And it's also easy to verify that I cannot push V further down I cannot I cannot make the values of this V lower and still keep this constraint satisfied So so the optimal value function is still a solution for this. Yeah It's a simple X method to solve the first LP Can we expect the policy to be sparse as the solutions of the LP would be on the edge like on the extreme points Right, right indeed. Yes. So so solutions to the primal linear program are indeed sparse Vectors sparse distributions that are only supported on well the corners of your domain which correspond to Deterministic policies so as we know that there is a determinant deterministic optimal policy in the original MDP This immediately applies that there is a sparse solution for the primal LP as well and these are and and the muse With non zero entries are going to be exactly the constraints in which this is tightly satisfied So many of these fundamental results from MDP theory can be understood from the program from the perspective of linear programming as well Alright, so so let me just give you like a bit of a history lesson here. So so this linear program framework has been Let's say first discovered in a special case by man in 1960 And also the Gellink I think that this person is from Belgium and the Nardo is a French mathematician in 1970 who stumbled upon the same formulation for for optimal control and they have slowly developed a general theory of these linear programs and They showed a equivalence already to the Balmain optimality equations And then in the operations literature, Schweitzer and Seidman proposed a method That really served as the foundation of most reinforcement learning algorithms that have been developed in this in this framework it's in particular they proposed a certain kind of relaxation of these linear programs to reduce the number of states and actions and Interestingly, they were also the first to define the squared Balmain error Which is now the key component of many approximate dynamic programming algorithms, including the QNs So I suppose you should credit them when you're talking about the approximate dynamic programming and Their approach was rediscovered and popularized by the Faria Simon Roy in 2003 Which really brought this idea to the reinforcement learning and ADP community and this inspired a bunch of really nice follow-up work so so the idea of all of this follow-up work is to Is to take these linear programs and then to just feed them into a linear programming solver and then And then extract the solution and then try to somehow analyze the properties of the solution that they got so So this is of course a very natural approach. This is many people's first idea Well, this is an LP right I can just feed it into the standard solver. There are super fast super efficient solvers for linear programs So why not do this? Of course one reason to not do this is that then mark of decision process of practical interest Typically, we have way too many states in order to be able to rely on an LP solver so there's just too many variables and too many constraints to deal with and One line of work has focused on reducing the number of constraints and the number of variables in these LPs and make them more tractable But of course the other problem which is even more serious in reinforcement learning is that well, we don't actually know how to How to represent these LPs, right? So if you look at both of these up is the primal and the dual you see that both of them involve Expectations over the next state distribution Expectations with respect to P both the primal and the dual have these so if I don't know my transition function And I simply have no hope of solving these or approximately Construct something that is digestible by a linear programming solver so So for for for this reason Not many people have been trying to work on these approaches Precisely because well since I cannot represent my LPs I cannot feed them into an LP solver I need to find alternative approaches and these alternative approaches is what I'm planning to talk about so Suggested me just go ahead in the interest of time and not do a break because I think I'm running the loop behind schedule So so I guess I'm just gonna continue. That's okay. That's not too brutal Right, so this is so this is what we're going to be doing instead of Solving the LPs actually I have to say that Tomorrow my student Germano is going to give you sort of like a practical hands-on lecture about these primal dual Methods and LP based methods and the first approach that you're going to be trying there is this LP based approach It's just taking the LP feeding it into a solver and see what you get and compare the solution that you get from With dynamic programming and they're going to be seeing some interesting good and bad properties of these solutions That gets from LPs but the main thing that I want to talk about is Well, how to use this LP formulation as As I said some kind of like a guidance for developing reinforcement learning algorithms, right? So how do I? Go run this idea of just solving the LPs directly and how do I? Figure out how to use this method that can work with samples, right? How do I address the challenges of reinforcement learning uses this as the starting point? in a similar way as Reinforcement learning methods are based on the Bowman equations So we're going to be considering essentially two settings one which is Planning with a full and own transition function P and R in which case we're going to be able to do Like very nice and easily implementable methods and then I'm going to show you how to work with generative models Which means that what can I do if I'm if I have the ability to sample? from the next day distribution given any Next day and this is this is a this is sort of a stylized planning model in which many reinforcement algorithms are Analyze them developed first And it's like sort of like easy to work with easy to understand easy to analyze. All right. So with that, let me start talking about primal do a methods. I guess we're gonna do a break when the Google meet starts to die Right so So I talked about these LPs both the primal and the dual the primal is an optimization in the space of muse And the dual is an optimization in the space of values But there is a concept that connects both of these which is the so-called settle point formulation So maybe let me ask real quick. How many of you know here about linear programming duality or Lagrangian duality? All right. Okay. Okay. That's pretty good. That's pretty good. So So so then I don't need to spend too much time explaining what this is. So So these primal and dual LPs They can be shown to be equivalent to solving this settle point problem Where this function that you're looking at is the so-called Lagrangian of the optimization problem so this Lagrangian is Is a function that takes two inputs It it is a function of the primal variables mu and the dual variables v And it has the property that the settle point of this Lagrangian Which means that the min minimum with respect to v of the maximum with respect to mu of the Lagrangian Is equivalent to a solution of either the primal or the dual LP now Now the reason for this is that uh Is that one thing that you can show maybe I can write it here is that uh Work on this technology a little bit so that if I want to minimize subject to The constraints on v Then this can be written as the minimum over v in the maximum over mu Of the Lagrangian Precisely because this uh Lagrangian I guess the main point that I want to make is is that if I maximize over non negative mu's of the Lagrangian Then it is easy to see that uh what i'm going to get is a function That takes value plus infinity if v Is not feasible And it takes value one minus gamma New zero times v if V is feasible Right, so this Lagrangian function has this special property that if V is not a feasible point Then mu Can choose to go to infinity and set the value of this function to plus infinity So as a result if I want to minimize The maximum of this function with respect to mu then I am Truly just minimizing the objective function on the feasible set So this is the key idea of Lagrangian duality, which inspires This set of one three four emulation So this property that I mentioned is easy to see if you observe the structure of this Optimization problem You see that if the constraints on on v are not satisfied, which means that That this thingy over here is positive if the constraints are not satisfied Is r plus gamma p v minus e v is positive So in this case mu Can choose value plus infinity and just simply run the value of this Lagrangian off to plus infinity and make v suffer plus infinity loss so essentially what I need to do now and what And the algorithmic framework that this observation inspires is the following I'm just going to look at this Lagrangian function It is a function of the primal and the dual variables And I'm going to try to find its set of point by treating is a treating it as a two player game, right? I'm going to Define two players one that controls mu one that controls v mu wants to Maximize the value of this function and v wants to minimize the value of this function And we want to make sure that you're following sort of game dynamics in a way that That you're converging to a set of problem Right so so this e operator I guess I should have put the definition on here. So this e operator If I transpose it Then it turns a state action distribution into a state distribution by summing out the action And the corresponding Operator and the joint of this operator it turns a state action function Or sorry, it turns a state function into a state action function by repeating its entries for all of the actions Right, so that's what that notation means Right, so this e operator is something that takes An x dimensional vector as input and it returns an x a x a dimensional vector as output such that The the entries of the input vector are repeated for all of the actions on the output. So, I mean, I should have written this as This constraint holding for for all actions individually. What this means is really just that All right, okay, so so so let's try to make this Algorithmic recipe a little bit more precise. So the idea That you're going to be following And I I'm going to try to give you like a little bit of a cookbook for developing primal dual algorithms here So the idea is going to be the following we're going to Figure out what optimization method to use to find approximate saddle points of the Lagrangian, right? So we're going to be using this kind of primal dual gradient descent idea to converge somewhere near the saddle point of the Lagrangian Uh Once we ran this primal dual algorithm, we need to we need to think of how we want to extract our policy from the resulting approximate solution and then finally we need to understand When this policy is going to be good enough for our needs we need to analyze the solution that we get So we need to decide about how to optimize we need to Decide about how to turn the Solution into a policy and then we need to understand the suboptimality of our solution Is there some reason why the saddle point problem is easier to solve than just the the linear program itself? right, so So we're going to see that for saddle point problems. We can develop algorithms that can work with That can work with samples from the transition function, right? And these are going to be stochastic gradient descent style methods for the primal and the dual players as opposed to just an LP solver That cannot really deal with randomness or stochastic samples Right, so that is that is really the the main reason for us to do this one. Maybe I should clarify that All right, so let's start cooking so So of course the idea as I said is to treat this problem as a two-player game and try to run Learning dynamics for both of the agents in a way that we hope that we're going to converge to a saddle point so the precise algorithm that So this is this is essentially just gradient descent for the primal and the dual players With some extra twist that I'm going to explain in a second. So So we are going to initialize both sets of variables Our mu variables and our v variables and we're going to perform sequence of updates on both of those I'll be first talk about the dual update because that's easy to To comprehend so we're going to be on updating our value functions are our v's our dual variables By gradient descent, right? So This player wants to minimize the value of the Lagrangian So it is going to move in the direction of the negative gradient of the Lagrangian Evaluated at the current point mu TVT and the gradient is taken with respect to the value function the expression of this is given here at the bottom And so this is something that is that is easy to check as for the as for the primal update We do something that's a little bit fancier So this is something that is called Exponentiated gradient ascent and this is a version of gradient descent that is more suitable for working with With probability distributions for for doing sort of iterative optimization in the space of probability distributions So here we are not adding the gradient to our To our variables, but you're rather putting the gradient into the exponent And this is Sometimes known as multiplicative weights or hedge or exponential weights algorithms So this is something that makes sense for for probability distributions because you can see that if This mu T is a probability distribution and it has non-negative entries that of course after multiplying With an exponential function this is still going to be non-negative, right? So I don't need to project back to the domain But it has many beautiful properties as well. So this algorithm is simple enough, right? You just do primal dual gradient descent essentially for both players and these kind of Primal dual learning dynamics They typically come with guarantees on what is called the duality gap So the duality gap It evaluates in a certain sense how close we are to the saddle point of the of the problem. So traditionally What is being used in this definition is v plus equals v star and mu plus equals Mu star these are the so-called comparator points of the duality gap Yeah, hi, perhaps a naive question. Why are we outputting the Average occupancy measure in the previous rather than the final literate because that should be Closer to the optimum. Oh, yeah. Yeah, right. Yeah. So So so so the reason for doing this is is that Uh, what these methods are trying to do is minimize their regrets from from their own perspective and what we know is that The final iterate of such algorithms Do not come with no regret guarantees. They do not come with specific guarantees And we need to do a certain kind of online to batch conversion and these methods Yes, that's actually a very important point that these methods do not output just their final iterate, but they output The average of the iterates. So this is there to sort of stabilize the solutions and And it is actually known to be required for these kind of like min max games in the first place, right? All right, so these are these are the solutions that we output this mu out and v out and the way to evaluate them Is in terms of the duality gap So like I said, the duality gap is uh, traditionally defined Would respect to v star and mu star And this is something that sort of measures how far I am From the set of points. So it's easy to see that if v out that if the output is v star and The output mu is mu star Then the duality gap The traditional definition there of is zero right so in this sense it measures how far I am from The duality gap or or sorry the the set of point so in So turns out that if you want to apply this Kind of methods for mark of decision processes I need to introduce a little bit more flexibility into the definition of the duality gap And I need to define the duality gap would respect to some comparator points Right, so I'm not trying to measure distance here from the set of points But the kind of distance from some cleverly chosen comparator points And turns out that This duality gap this quantity against comparator mu plus v plus Can be Written as is a very convenient form. So I'm not going to do this entire derivation. I'm just going to show you that That the duality gap can be essentially Rewritten as the sum of the regret of the max player And the regret of the min player So the duality gap is a quantity right that After some minor mathematical gymnastics can be rewritten as essentially the optimization error of the first player And the optimization error Of the second player their regrets against comparator v Pi so what this what this means what regret means in this case is Essentially measuring the extent to which For example, this this regret term on the right Measures the extent To which the vt player The player picking the vts regrets not having picked v plus during the entire optimization process How much additional loss did it suffer? Not having known v plus since the beginning Of the decision-making process and and gradient descent and an exponential gradient descent Can guarantee that these quantities are going to be growing slowly And actually when divided by t they are going to be going to zero at appropriate rates So this is so this is the type of guarantee that these that these methods are going to give us Right. So this is this is about the optimization and the type of guarantees that we have on optimization algorithms Now the next question is how do I extract the policy from my solution and the natural idea there is To simply take the output occupancy measure mu out and then simply Calculate the conditional distribution of actions in each of the states Thus converting the occupancy measure or the approximate occupancy measure into a policy So this this is this is something that works as long as the denominator is positive for each state x Otherwise if this is not true then It's sort of tricky to extract the solution. So this is one downside of this like whole linear programming framework in the first place And if I do this then I can show also that That the the policy that I output is some mixture of all the policies that I've calculated over time With some coefficients that are sort of like very difficult to calculate. It's given by It's given by some complicated calculation. This is not like a very practical way of representing this policy. So So let me let me let me show you how To put all of these pieces together and how to analyze the solution I think that this is like one of like the most satisfying results That has been done in the domain of this primal dual methods. So The the problem that you face when you try to analyze this method and And I have and many people have tried to face this problem. So all of these papers that I'm listing here on the top This is a very nice paper of M. M. D. Wang Some follow-up with her student and Chin Nang Cheng And also with my student Johan Bass we've been trying to work with such methods And we've all faced this problem That the quality of the final policy that this procedure outputs is sort of hard to connect With the duality gap, right? So what I have is a guarantee on the duality gap This is directly controlled by the sounder regrets. The big question is how do I translate this guarantee into About the quantity that I'm interested in what I'm interested in is how suboptimal my policy is right? How far am I from the optimal policy? The duality gap doesn't tell me anything about this or at least not so far people have been trying to connect this coming of it like more and more complicated methods and and until Until this really brilliant paper by Chin Nang Cheng came out in 2020. This is a as stats 2020 paper Whose idea was the following? It's it's really mind-blowing. Hope you're hope you're gonna find this satisfying because I'm going to show you the proof So what they showed is that if I choose the comparator from you as mu star And if I choose the comparator for v as v pi out notice that that That this comparator point is not v star most importantly So this is not the traditional comparator that is chosen for traditional purposes of min max optimization So I pick v pi Sorry v plus the comparator for the v players v pi out Then I can show an exact relationship between the duality gap. So this is the duality gap. I can show that this is exactly The gap between the expected reward of my policy and the expected reward of the optimal policy So this is the quantity that I'm interested in So what they show is that there is a direct relationship between these objects and in particular what What they show Is that this first term is exactly equal to mu star times r And this second term is equal to mu pi out Times r and this and the second Half is quite smart. So let me show you the proof because I really like it. So So the first part is relatively easy just consider the Lagrangian evaluated at mu star v out So by plugging in the definition I get this right. So this is just the definition If I reorder the terms a little bit. So I keep mu star r first and then I Move this p v out or sorry I guess this Gamma p multiplier from one end of the dot product to the other and I do the same with this e Then I arrive to this expression right. So this is Gamma p transpose mu times e transpose mu star, right? So I just moved these multipliers to the other side and all of this is multiplied by v out Right and then I look at this formula. So what do I notice? So I know that mu star is an occupancy measure, right? So it satisfies the constraints in the linear program, right? So and as a result this whole thing is just zero, right? Because mu star is an occupancy measure. So that that term just goes away. It satisfies the constraints, which means that it is zero By the feasibility of mu star. So this was sort of the the easy bit The other one which required the touch of genius Is to evaluate the Lagrangian At mu out and v pi out v pi out is not v out, right? It's the value it's the value function of the policy that I output at the end of the day So if I plug in again just the definition Then I see that I'm going to have mu out Plus this right. So this is r plus gamma p v out minus e v p out And if I make the additional observation that mu out times e v pi out Is equal To mu out times q Pi out Right, so this is this is something that can be shown and Just like one line of calculation that I'm not going to show you But uh, then this implies that I can replace this e v up v pi out By q pi out And then what do we notice? Well, we notice that this entire thing is zero, right? Because because of the baman equations, right because the baman equations are satisfied, which says that q pi out Equals r plus gamma p v pi Out right. So these are the baman equations for the discounted Markov decision process for policy pi out and as such This whole term becomes zero, right? It just the left hand side of the of the baman equations minus the right hand side of the baman equations So this whole thing goes to zero and all I'm left with is this term, right? Okay, so I guess Enough of these red ink So then what can be shown is that this entire first term is zero Because of the baman equations because of my clever choice of comparator And the second term is really just the average award of my optimal policy, right? So it's just the the value function of the output policy Evaluated at the initial state distribution, which is exactly what I want to maximize, right? So this is this is what shows This is what concludes this proof that shows That the duality gap evaluated at this very Curiously chosen comparator point Exactly equals the suboptimality gap of the policy that I achieve at the end of the day right, so let me show you then What sort of guarantees one can derive from this? So So if we run iterative algorithms for Both the primal and the dual players for example the scheme that I showed you before Exponential weights for the max player and gradient descent for the min player And let's say that these methods have regret bounded by r mu and r v Perspectively then what I can show using the the previous calculation Is that the suboptimality of the policy that I output Is simply upper bounded by the regret of the mu player plus the regret of the v player Divided by t right, so this is just putting the two previous results together And this is this is really just well just follows from this one line calculation at this point so Yes, this is this has been done by cheng at all and also another very nice paper by by jane sitford So if you apply specific regret minimization algorithms in this framework, then one guarantee that again derive from this For example, if you know the exact transition function p and the exact reward function r and it means that Then it means that you can run this algorithm exactly Sorry, I need to go back quite a bit Right, so so here, right in this primal dual mirror descent method that I've shown I've said that both of the players are going to be using the exact gradients which require evaluating the transition function And the reward function without errors So if I if I'm able to do this, right that means that I'm in the problem of Of that I'm in the setting of planning in a Markov decision process So then I can implement this algorithm exactly without errors and without noise in the gradients And if I do that then I know that both players are going to have regret that is bounded by square root of t and And I'm going to have a suboptimal An epsilon optimal policy after a given number of iterations In particular after one over epsilon squared iterations, I'm going to find an epsilon optimal policy So why is that so that is because the regrets? I guess sorry. This is like the regret divided by t for both players The regrets over t they go to zero at the rate of one over square root of t, right? And as a result There is going to be one time in which The errors are going to go below epsilon and this t epsilon is the first time such that the regrets go under That epsilon and if you work through the math, then you're going to see that after one over epsilon squared times x squared a Times you are going to find an epsilon optimal policy with this scheme so now of course You can you can tell me that this result is not particularly great so this is like not like a real competitor for dynamic programming because Because you know in this scenario when I know p and I know r I might as well just do dynamic programming I might as well just do value iteration and I can converge linearly to an optimal solution So this is so this is not such a great and an attractive result but this But this method really starts to shine If you allow it to use Stochastic gradients right if you start considering the problem of planning with a generative Because in that case you can build stochastic estimators of the gradient And this is something that you're going to be doing in the live session tomorrow Which is to take the gradient of the logarithm with respect to v Yeah, I guess some of this notation is like a little bit scramble, but I think it's sort of understandable that Right, so this is the gradient with respect to mu So if I take the gradient with respect to mu which takes this form I can simply Find a stochastic estimator Of this an unbiased estimator of this by simply replacing this p this transition function with A simple Sample transitions So in particular what what I can do is that for all x and a I can generate a next sample And I can set p hat t Of x I guess x prime x a as simply The indicator that the sample that I draw Was x prime this is an unbiased estimator of Of the transition function in this state action next state triple And I can just simply plug this into my estimator and this is going to give me an unbiased estimator of The gradient so this is the power of this method that it is very very easily adapted to stochastic updates and stochastic gradients and similar tricks can be done to come up with With with stochastic estimators for the gradient of the v player as well and And what you can show perhaps surprisingly is that this algorithm if you use primal dual gradient descent Uh With with these estimators In this stochastic setting again with appropriate modifications to the notation It satisfies the exact same guarantee as what I got for For the for the full information version of the same problem So in particular i'm going to find an epsilon optimal policy after one over epsilon squared Samples and the number of necessary samples is also scales with With the number of states and actions In a relatively reasonable way But what is notable notable about this that having developed all this like theory ahead of time Proving this result is basically immediate, right? So we just put together the regret bounce for the primal and the dual players We get this guarantee which is which is close to being optimal for this specific setting, right? So uh, so in the remaining Little time that I have let me just say a few words about how to extend all of this to linear function approximation Uh, I'm not going to have like really any time for this So maybe I'm just going to give a shout out to my student who's been working on a lot on this She's falling asleep right now, but No, she's seen this so many times that's uh, so that's neck and and this is uh largely based on on her work so the Problem with these with these uh, well cursed LPs is that they They are really just way too high dimensional They have way too many primal variables and way too many dual variables way too many constraints for this method to be applicable to To large-scale reinforcement learning scenarios So, uh, so it's unclear at the moment Or it or well it used to be a big question how to introduce function approximation in this scenario how to Introduce q functions and how to use linear parametrizations for these q functions in a reasonable way so after I quite a bit of research And some of this was uh was done by my post doc here a pike birk and and john bus my student uh We have managed to come up with a sort of relaxed version of the LPs that have A q function in there is in particularly the dual So if you look at the primal lp, this really corresponds to some kind of like a constraint splitting and projection in the In the primal with what's important about this is that this uh, Is that this object here can be thought of as a linearly parametrized q function And this allows The use of linear function approximation in the linear program scenario And and it can be shown that the optimal solutions of these LPs are going to be the the true optimal policies under some standard But restrictive conditions on the on the mark of decision process And what's most interesting from the perspective of this talk? is that We can turn this into a similar set of points problem as before by introducing the lograngian by introducing by folding the the primal and the dual LPs in the two and finding min max points set of points of this Of this little optimization problem so So here the solution that one can think of is just again Use primal dual gradient descent on all of these new variables that you have that you have introduced We have introduced some variables d for which we're going to be doing some kind of gradient descent updates We have introduced these parameters theta for which we're going to be doing gradient descent and so on And then conclude all of this calculation, which is Which can be quite Involved by outputting the average of the d variables that we calculate right, so uh, so this is uh I know that you're like going rather fast here and I don't really have like too much time to explain the details But it can be shown That again, we can extract the policy from this solution d out using the same procedure as before By just calculating the conditional distribution of the actions given states Uh And then it can be shown that this that this trick by chang et al It continues to work and we do get some kind of like an optimality guarantee for the resulting policy But the problem is that uh, that this uh, that this variable d pi out is still a very high dimensional variable And this is not this does not result in an algorithm that you can That you can run in reality The problem is that the policy that we get from all of this Is going to be some kind of like a weird weighted mixture of policies that you calculate And these are just somehow like not practical at all at the end of the day So, uh, so this feel at this moment you may feel that you know, I just can't do it all of this like primal dual Methods, this is not something that gives you practical algorithms But turns out and this is a problem that we've been working a lot on with neck Uh, is that you can sort of adapt all of this primal dual framework Uh, again, there are many many peculiarities Of these of these are written that I don't have time to get into But the most important thing that I want to highlight here is that uh Is that the policies that are used by this method Are all softmax policies, all right And the policy that is returned by this primal dual procedure Which is some kind of like complicated version of the same method With some bells and whistles So the so the output produced by this method is simply one of these policies that I have just calculated And this is something that comes from another online to batch conversion So importantly the the output of this procedure has a simple and nice softmax representation Running the algorithm itself can be rather complicated It may require full loops over a very very large part of the state action of space But once I've ran my algorithm, I can return a very simple policy Like without having to rely on converting an occupancy measure Into a policy so somehow this method allows us To not really work in the space of occupancy measures and sort of work more directly in the space of policies So I'm going to skip all this Beautiful technical parts and maybe I'm just going to flash this result. Uh, so this gives us sort of a way of Producing an epsilon optimal policy in polynomial number of samples in infinite horizon MDP is under linear function approximation Which is still a problem that is very difficult for other theoretically grounded methods that are computationally efficient So this is a I think a very interesting result So I'm going to skip over all of this To try to finish on time. So, uh, so let me just say a few words about future directions. So So, uh, all of this stuff that I told you about was, uh limited to the setting of planning Either with full knowledge about the transition function and the rewards or generative access to To the same objects and and I would say that this is true of essentially all Currently available primal dual methods for reinforcement learning Currently these methods are limited to these kind of relatively restricted setting There has been some work on trying to derive reinforcement learning methods from it So for example, we have a follow-up paper to our work with with neka that was done with my student germano and matthew and and and neka Where we did essentially the same recipe for offline reinforcement learning that is somehow like a little bit more satisfying algorithm So we can find that thing on archive But really the big question is, okay, how do we derive practical like really actually Implementable and and and practically applicable Reinforcing learning methods from this framework So right now there's no final answer to this But I will say that there are several methods that have been derived from this linear programming framework that I did not mention today Gary has already mentioned a relative entry policy search By by young payters at all And I think matthew already also mentioned this we had a follow-up on this It's called logistic learning with my student joambas and also another similar method Called convex you learning by Sean mine and then co-authors. So so there had been some progress in this direction extending LP based and primal to a style methods into the reinforcement learning scenario and I would say that that this is Probably a very good moment to start thinking about these problems. It's a very promising area with a lot of open problems That that that await to be solved and hopefully all of these methods can eventually become like a little bit more practical A little bit more accessible and I hope that you found this introduction to be useful And you're going to be able to work on some of these problems in the future Right. So now let's get us some reward What were the probabilistic qualifiers for those sample complexity results on the stochastic versions of these So, yeah, so these these bonds hold on expectations expectation, that's what you mean. So somehow this this method is is inherently random and It looks very hard to to provide high probability guarantees here because Of this random policy that we output, right? So for such randomized Output policies, it's hard to provide high probability guarantees. So that's a that's a nice open question. But We were we were we were like Think about one million things to avoid doing this But but this is the best thing that you could come up with but we are very happy with it that we didn't need to do The the occupancy measure stuff. Okay, so maybe one more So the magical theorem, uh, you showed that had an equality Right. And if we take the regret balance to be lower balance, uh, Is the work to show results on the limitations of this approach? Right. Yeah, so so regret lower bounds are typically Especially constructed counter examples and those counter examples cannot be necessarily embedded in this reinforcement scenario The way to prove regret lower bounds the way to show that well This and this learning problem is harder than this and this right is typically to construct like a very peculiar Hard case that is not necessarily embeddable into this setting So those those hard cases are typically like unstructured and So, uh, they're mainly examples that we're not interested in. Is that what you're saying? So these, uh Examples that are handcrafted their examples that we're not interested in is what you're saying I'm wondering if there's anything that would rule out hope in any special case or Right. Yeah, so that's a That actually leads to like several good points including like lower bounds on The best achievable complexity by such min-max based methods Yeah, I think that's a good thing to look into but, uh I would find it really hard to like Think of lower bounds in this scenario. Let's let's do that at coffee. All right. Thank you very much. Okay