 Okay, fine. So this is work that I have not spoken about, I think at IIT Bombay before. I was at the Indian Institute of Science for the last six months before I moved here. This is work that started then. My co-authors are Neel Dharamishra, who's an Inspire faculty fellow in the CSA department at IISC. Aditya Gopalan is an assistant professor in ECE over there. Okay, this is recent work. It hasn't been published yet. And as you will notice, there's a lot of rough edges and a lot of room for ideas to be refined. And I do hope this will be a session where the audience can actually think about some of these problems and also help me out, maybe we're sorting out some of my thinking, okay? So the problem is that of MVP planning. So I will begin by describing that problem in some detail. So it spends a time formulating, like describing that problem precisely. And maybe about half the talk, I don't mind spending about half the talk, talking about existing ways of dealing with this problem because I'm really set it up for us. There are at least three clearly demarcated approaches for finding an optimal policy in an MVP, including linear programming, value iteration and policy iteration. And having done that, it will easily fit in how our contribution adds to this landscape. In particular, we use an algorithm called Planning by Guessing and Policy Improvement. I'll present the algorithm along with its analysis. And I do hope to have like five, 10 minutes at the end of this for, I think relations between this approach and linear programming is the topic I don't know very well, especially there are people in the audience who know linear programming and simplex and pivoting and all that very well. I think you'll be able to appreciate much of the connections that I'm going to outline and I hope also for the people. You can ask me questions at any time. I'll just make yourself visible. Okay, MVP planning. What is an MVP? An MVP is a Markov Decision Problem. It's a standard abstraction for sequential decision making. If there's an agent in an environment that must take actions that maximize its long-term gain, how must it act? This question is mathematically formulated using a Markov Decision Problem. A Markov Decision Problem comprises a set of states, S, a set of actions that is available from each state, a reward function that gives a real value of return for every action taken from every state. If you take an action from a state, you'll get a number as a reward, right? And that is specified using the reward function. The transition probability matrix essentially governs how transitions to next state happen. Every time you take an action from a state, the transition matrix specifies a probability distribution over next states, right? And so you will be stochastically transported to one of these states based on probabilities in this matrix, right? So S, a, r, and d. What an agent is interested in doing is to behave optimally. And an agent's behavior is encoded towards what's called a policy. A policy is a mapping from states to actions. By this definition, it is a policy that is Markovian, that is deterministic, that is stationary, these are various things, but in essence, being a policy is a mapping from states to actions, which means from every state, the policy tells the agent which action it must take, okay? And you can imagine that associated with every policy, there is the long-term return that the agent will gain by following the policy, right? There's also associated with, you can also associate the long-term return with every state, right? So precisely, v pi of S, the value of policy pi starting from state S is in expectation, the reward at the next time step plus the reward at the next time step plus the reward at the next time step all the way to infinity, except that we discount future rewards by a discount factor gamma, it's a number between zero and one, in order to keep this number bounded, okay? This is a definition of expected long-term reward under what's called the infinite cost model. You sum your rewards to infinity, you discount them using the discount factor, there are alternated definitions, there's something called average reward, there's something called total reward. If you assume that your MPP will always take you to states, right? Which are like same states where your policy ends, then you don't need to discount. Even without discounting, you can guarantee that this number will remain bounded, okay? So we've said that an MPP has states, actions, rewards, transitions, we've said what a policy is, a policy is not a property of the MPP, right? A policy is a separate object. You can think of the discount factor as a part of the MPP. So an MPP in essence is specified by a set of states, a set of actions, a reward function, a transition function, and a specification of how much the future matters, right? If gamma is small, right? Then the expected long-term return is not going to depend that much on the real long-term. The larger you make gamma, the closer you make it to one, the more your long-term rewards matter. The planning problem precisely is given an MPP, given S, A, R, P, and gamma, find a policy pi star. From the set of all policies pi, such that this is true. Such that from every state in the MPP, following pi star will give an expected long-term return that's at least as good as that you will get by following any other policy pi in JNDP. So let me say it's in English, okay? You are in an MPP, following any policy, you're going to get expected long-term returns from every state, right? The expectation is because there's stokas seeing your transitions. A policy pi star, by the way we have defined it is one from which following pi star is necessarily at least as good as following any other policy in terms of the expected long-term reward metric from every state, okay? The planning problem is given an MPP can be find such a policy pi star. Now the problem is well defined. Given an MPP, indeed there exists a policy pi star. So the problem is well defined. The question is how difficult or easy is this problem of finding such a policy pi star? Questions? Yes? So since that policy is really a pasting of all the policies within every state, right? What's a pasting? For pi is from S to A, right? Right. So, I mean, is this, for all S, is this starting point important or not? I mean, I don't understand this. So it might be that you never rate certain states, right? So no, V pi S is, so if the reward is I start, use policy pi and start at S. Got it. Now, I mean, that says you know, if some other policy is better starting from S prime, then I can join the new policy that for S I'll start at pi and for S prime I'll start at pi, we'll use pi prime. That is also a policy, right? Yes, that policy presumably dominates both the other policies. So in that sense, this is really a local speed question. You know, starting at S, what is the best policy? Can it be used today? They are not fundamentally very different, right? Because if you have lots of local policies, I'm just agglomerating all of them, pasting them and calling them one policy because I've defined it this way. Good. Yep. Okay, fine. For the purposes of this talk, please bear in mind that we are in an MVP which has a finite set of states and actions. In particular, let N be the number of states and K be the number of actions, all right? We're going to derive bounds in terms of N and K. So let's keep that with us as we go around. Okay, how might we find an optimum policy in an MVP? We look at three different ways in which we can do it and really this third family policy iteration is the one that is closest to our own contribution in this work, right? I'll spend more time on policy iteration than the other two preceding methods for good reason, right? For the reason that it'll motivate what we end up doing and give us a baseline with this to compare. And then I'll go into some detail describing what our algorithm is. Fine. Now, it can be shown, right, that the value function corresponding to the optimal policy pi star. Now, the value function is just a way representing the values that you will get at every state. The value function is a vector with N entries with one entry for each state corresponding to the value of that state under that policy. In particular, if you take the optimal policy, it's optimal, it's value function which is V of pi star is usually denoted V star. It can be shown that V star is the unique solution of what's called Bellman's optimality equations, okay? So this ought to look like something like the expected long-term return to you, all right? So R of SA is the reward that you get by taking action A from state S. And with this probability, you will go to state S prime, right? State S prime is the probability which it should go to S prime. And then the subsequent long-term reward that you will get is encapsulated in V of S prime, all right? By putting a max operator here, what we are saying is under an optimal policy, you will always be taking those actions that maximize this sum, immediate reward plus gamma times reward from the next state, all right? So, I mean, this is like a basic textbook kind of material that the optimal policy is unique solution of Bellman's optimality equations. Now, it turns out, right, that Bellman's optimality equations are not particularly straight forward to solve. These are not linear equations, right? You have, how many equations do we have like this? Sorry? Correct, we have one for every state, right? And there are n states, so there are n equations like this. If you take the state values to be unknowns, then you have n equations and n unknowns, but they are not linear equations. Turns out, though, that you can derive a linear program based on Bellman's optimality equations, right? So, you will see that maximizing this quantity are the sum of these n numbers subject to these constraints, of which there are n times k, one for every state action pair. Solving this linear program will amount to finding V star. The solution of this linear program, this unique, will indeed be a solution of Bellman's optimality equations. So what have we done? We've done whatever textbook material to say that if you want an optimal policy, at least you can get its value function by solving Bellman's optimality equations. You can solve Bellman's optimality equations by solving this linear program. This is a linear program with n variables and n times k constraints. There's an alternative way of formulating this. You can make a linear program with n k variables and n constraints. But at any rate, what's the complexity of solving a linear program? I know, yeah, I hope somebody will spell it out anyway. So one way you could do it is using an interior point method. You can get it done in pseudo polynomial time. That means polynomial time and the number of constraints and variables, which is n and k. And there is yet a dependence on the precision to which you want your answer. Equalantly, if you're representing your MDP with a certain number of bits, that dependence will come into your bound. You do not have, you do not know a strongly polynomial algorithm for linear programming. Fine, yeah, so this is that, okay? Fine, so linear programming is one way of doing it. Value iteration is a very contrasting style of finding the optimal policy. So this gives you, partly, the value function of the optimal policy. It's very easy to derive the optimal policy from the optimal value function, all right? So just take that from me. More standard approaches for MDP planning use dynamic programming, right? Which means they try and solve some problems and then use those solutions to solve bigger problems and value iteration is one such approach. How value iteration progresses is as follows. You start with an arbitrary initialization of your optimal value function. So v zero is an arbitrary n length vector, all right? Repeatedly apply the following iteration. v p plus one is going to be an update that depends on the reward function and the transition function. Apply to v p, okay? Essentially what we're saying is we have a current estimate of our value function. We're gonna improve it by applying the following operator. This is called Bellman optimality operator. And it is possible to show that you keep doing this. You will converge and you will converge to v star, okay? The number of steps that this will take is what is going to interest us. So each of these updates themselves is fairly simple, right? The number n times k is something like that. The n entries you have to update inside for each of these. Maybe n squared k, right? You have n elements in your v. Each of them depends on the n elements in your previous v and you have to take a match over all the actions. So that's n squared k. It's polynomial, right? It's strongly polynomial in n and k. So what really governs the complexity of this algorithm is the number of iterations it's going to take to converge. And indeed it will take as many as polynomial in nk the size of the representation and one over one minus gamma to converge. In essence, gamma, one minus gamma, one over one minus gamma is the length of our horizon. Right? If gamma is a discount factor, what we're saying is up to one over one minus gamma, the values of states that we might reach are going to influence our current value function, right? Beyond that point, up to machine precision, after that point it does not matter. But up to that point it does. Right? So you really cannot bypass this dependence on one over one minus gamma, which comes into our problem. So linear programming gave us poly nkp. We have an additional one over one minus gamma here, which is not always bad in practice because people find the linear programming using simplex or using your best solver, cplex, does not do nearly as well as these approaches such as value iteration. Value iteration is going to do much better if, for example, you can initialize with a good value function, right? You have some way of eliminating actions. There's lots of tricks that you can do, but in a formal theoretical sense, you can construct MDPs in which this dependence on one over one minus gamma shows up, right? And it can become arbitrarily bad as gamma becomes closer and closer to one. Yes? So we looked at linear programming, we looked at value iteration. Next, we will look at policy iteration. But before that, it's very useful for us to know subroutine that policy iteration will use. This is a very common subroutine that one needs to know if one's doing MDPs and reinforcement learning and planning, all right? So the nonlinear equations that I spoke to you in the context of linear programming were Wellman's optimality equations. Here, we will define something called Wellman's equations, okay? Wellman's equations are written down when you are given a policy, right? Wellman's optimality equation will take a max over all the actions, but let's say you're given a policy five, remember that we defined v pi of s as this, right? This expected long-term, discounted reward. You can write the same thing recursively. Instead of writing v pi of s as this expectation, you can instead write it in terms of v pi of s itself, right? Essentially, gamma times r2 onwards might be replaced by gamma times v pi of the state in which we are. That's what's happening when you fold it in recursively. And this set of equations, which depend on pi, right? This is v pi of s, this is not v star of s. Again, on the RHS, we have v pi of s. It's called Wellman's equations, all right? v pi is called the value function of pi. And just like v pi, we also define something called q pi. This is the action value function of pi. So v pi is the expected long-term return. v pi of s is the expected long-term return of pi starting from state s and following pi, right? The slight difference that we need to define q pi is q pi of s a is the expected long-term reward that the agent gets by taking action a from state s, right? It just does it one time. And then subsequently, it follows pi, all right? There's just this one-step difference between q pi and v pi. That is the action value function. And therefore, by that definition, v pi of s is exactly q pi of s pi of s. If you're actually, if you're a is the action that your policy is specifying, the two are exactly the same. It's useful for us to know this also, all right? These are both definitions. Now, I asked you this question about Bevin's optimization equation. But given a policy pi, how many equations will we have like this? And again, right? One for each state on the LHS. How many unknowns do we have? And these are linear equations. And you can solve them using Gaussian elimination using your favorite solver in time that is, let's say, order n cubed, right? Or n to the 2.23, whatever. Yes? So given a policy, what we have just seen is that you can find the value of that policy by doing what is called policy evaluation by solving the set of linear equations, all right? This set of equations is called Bevin's equation. And this whole operation is called policy evaluation. You're given a policy. You want to find what its value function is, or somewhat equilibrating what its action value function is. You can do this in n cubed time, for all practical purposes. Yes? But how is it different from Bevin's optimization equation is what we were trying to do there? Correct? Bevin's optimization equations may have no policy that they're working with, right? They're working with the optimal policy in an abstract way. And finding that, naturally, is hard, right? Which is why you have to do the linear programming. It takes a long time. But given a policy in n cubed time, you can order n cubed plus n squared k or something like that. Time, you can find out what the value function of this policy is. And this is a subroutine that will be used in the next family of planning methods that we will see, which is called policy iteration. So the first family is linear programming. The second is value iteration. The third is policy iteration. And I have to set up a little bit in order to explain what policy iteration does. So for a given policy pi, let i of pi define the set of all states for which there exists an action whose q value from that state exceeds, strictly exceeds the value of that state under that policy itself. So this q pi of s pi of s is basically v pi of s, right? So s is inside here if s is such that the action that currently you're taking under pi can be improved by taking another action in the sense of q. So it's complicated to say this math. It's also complicated to say in English because there is so many variables. OK, so if instead of pi of s, I took a from here for one step and then did pi subsequently, then what I'm saying is s is such a state that the former would have a higher expected long term return than the latter, right? And all such states are included in this set i of pi. So as we just saw, q of pi is easy to derive given a policy pi. You do policy evaluation, you get v, you can also get q. And if you know q and v, you can just run over all your states, do this comparison, and see if every state, whether it belongs to i pi or it does not. So given a policy in something like n cubed time, you can compute q pi and i pi. The key statement governing this entire family of thinking, policy iteration and the whole approach here is the following, right? If i of pi is empty, then pi is necessarily an optimal policy. And the opposite too, right? If pi is not an optimal policy, then i pi will not be empty. What does that mean? You have some policy, you evaluate it, and you see there are states that can be improved. If there are states that can be improved, then the policy is necessarily not optimal. On the other hand, if you find that your policy, you evaluate it, and there are no states on which you can improve it, then the policy is necessarily optimal. This will become the basis for designing algorithms in the family of policy iteration. And this is how we will do it. Let us assume that we are not with an optimal policy. Let pi be a policy such that i pi is not empty. Let us arbitrarily choose a subset, a non-empty subset of i of pi and call that c of pi. So i of pi really is a property of your policy. It's defined. C of pi, on the other hand, is a quantity that we choose. You can think of it as a random variable if you want. But we can choose it in any which way we want. It is a non-empty subset of i of pi. So if i of pi has three states, maybe we pick one of them and we want to call that c of pi. Define now a policy pi prime as follows. Pi prime is going to be exactly pi of s. If s does not belong to c of pi, if on the other hand s belongs to c of pi, then pi prime is going to take an action that maximizes q pi of s a from s. Now, I am going to say this in English. You have a policy pi. You evaluated it. You found that i pi has bi states. Arbitrally, you are going to pick a non-empty subset of these pi states. Let us say maybe we pick three states. Now we are going to define pi prime as follows. In all the states, except those three, we are going to stick with pi. But in these three states, we are going to take those actions that maximizes q pi of s a. So pi prime is a policy that is derived from pi by evaluating it and then choosing the subset. The policy improvement theorem is essentially this, that derive pi prime for any arbitrary choice of choosing this c pi. For every such choice, it can be shown that for every state, value of that state under pi prime will be at least as good as the value of the same state under pi. There will exist at least one state in which pi prime is strictly better in terms of these values. It means that by improving from pi to pi prime, this is called policy improvement, by the way. This is operation, right, of going from pi to pi prime. By evaluating and improving, it's called policy improvement. If you do improvement, then pi prime will state-wise dominate pi. That means in every state, it will have a value that's at least as good. There will be at least one state in which it's strictly better. So maybe let me ask the artist a question at this point. So let us say we start with some policy pi. And now I've given you this technique that if you do policy improvement, you can go to this policy pi prime that has this property that it will state-wise dominate pi. Does this tell you a way of planning, of trying to find the optimal policy? How would you apply this to find the optimal policy? Use pi prime and s if you have pi. And slice the rule. Just slice in pi prime. That's pi prime, right? But I want to find the optimal policy. No, no, sorry. Keep doing it. Yes. Keep doing it. I don't know what we're doing. Keep doing it, right? So it should strike you that by doing this repeatedly, we will reach the optimal policy. Now why is that? Let us start with any policy pi. Do improvement. We go to pi prime. We evaluate pi prime. If pi prime can be improved, we'll improve it to pi double prime and keep doing this. Why would this reach the optimal policy? Because in terms of this definition, right? But yeah, you can just say it more basically in terms of this, right? Pi prime, or whichever policy you are, is not an optimal policy. Then it'll have this i of pi. And when you do that, i of pi, this will happen. And because the number of policies, you'll never revisit the same policy by using this iteration. You go from pi to pi prime by improving. And then you keep doing this. You'll never revisit the same policy because, as Newton mentioned, this is happening. And you can not do this endlessly because the number of policies is still finite. In an NDP with N states and K actions per state, what's the total number of policies? k to the N. k to the N, right? You can take any of K actions in any of your states. So k, k, k, N times. So that's k to the N. And so it has to terminate. And it will not terminate in a locally optimal policy because of this, right? Because if it is not optimal, then i pi will not be empty. This is a sound algorithm. And what do you think will be the number of iterations? What's the trivial bound on the number of iterations that this algorithm should take? The number of policies. Number of policies itself, right? This is k to the N. Good, OK? So this is exactly what we'll see on the next slide. This is an algorithm that was, I think, introduced in the 1960 textbook by Howard. We start with an arbitrary initial policy. You keep repeating this loop. You evaluate pi and find out its improbable set. You pick a subset from this improbable set and improve, right? Do policy improvement. You keep doing this until you hit an optimal policy, OK? Now, yes? So why? That's exactly what it is in Howard's paper, right? That's for the reason that I have to explain things to you about two or three different ways in which you can do this. The most canonical way of doing it is indeed by switching to improbable actions in all of your improbable states. So when people say policy iteration, normally they think of greedy policy iteration. If you have five improbable states, you switch to improbable actions in all five of those states. So c pi is exactly equal to i of pi, OK? This is called Howard's policy iteration. And it was not until a 1999 paper, right, 40 years later that there was a bound on the number of iterations of Howard's policy iteration, which was shown to be out of k to the n over n, right? We've gone from k to the n, which is a trivial bound, to k to the n over n, OK? Now, remember that a theorem holds for any choice of c pi. It doesn't have to be exactly equal to i of pi. So there's a randomized version of policy iteration by Mansour and Singh, right? We're going to call this RPI, I think, in the remaining slides, where c pi is chosen to be a subset that is chosen uniformly at random from among the non-empty subsets of i of pi. So if you have, let's go back to my favorite example of having five improbable states, right? How many non-empty subsets could we have of such a set? Yes, 31, right? So you have two to the five subsets of which one is not valid because it's empty. And then you choose uniformly at random from any of those 31, and you keep doing this. And that is called the Mansour and Singh's randomized policy iteration algorithm. Because it's a randomized algorithm, you get bounds that are an expectation, right? And Mansour and Singh have slightly improved analysis when you have two action MPPs, then k equal to 2. And they show that this should take no more than two to the point 7 and 8 and iterations to complete. For general k, it is something like k over 2 to the n. k over 2 to the n is still much better than k to the n, over n. All right? So again, so this is the number of iterations. What's the complexity of each iteration? Sorry? It's n cubed or something like that, OK? Some poly of m and k. OK, so yeah, these are the references. And again, to note now are that these bounds, right, do not depend on B and gamma. The complexity of Howard's policy iteration is poly of n, k times k to the n, over n, OK? There's no dependence on B and gamma because your policy evaluation does not depend on B and gamma. If you assume that you can do a precise exact floating point arithmetic, regardless of the size of your representation, then the number of such operations that you will perform can be bounded purely in terms of n and k. And this multiplied by polynomial of n and k is the computational complexity of these algorithms. Yeah? OK, good. So this is also our interest, right? What we contribute is also an algorithm that we intend to bound in terms of n and k alone, with no dependence on gamma and the size of the representation. What is the horrendous thing? Yeah, so I wrote this in shorthand, right? So Mansoor and Singh showed that this is no more than, I think, 6 times k to the n over n. And Hollander's itself said, no, the same algorithm is not more than 2 plus odd, a little over 1 times k to the n over n. There's been a lot of effort, even, like playing with the constants in these bounds, right? You would imagine k to the n over n is basically exponential, right? But strictly speaking, it's somewhat better than just k to the n, right? And yeah, this has been an effort that has taken really long, because the algorithm is very simple. I think people have very little understanding or control of the structure of these algorithms, really how they traverse the space of policies. Bounding things becomes very difficult. So any lower bounds, no? For policy iteration, there is one which I will describe on my future work slide, which has some problems. So an NDP has been constructed on which Howard's policy iteration will take at least 2 to the n iterations, where n is the number of states. Turns out this is an NDP which does not have a fixed number of actions per state. It is not, like in our definition, it does not have k actions per state. It can have a variable number of actions per mean state. And indeed, some of these states have as many as order n actions. So there is an exponential dependence on the number of states. But the NDP definition is somewhat different from this. It actually caused a problem for us. Is that a good point? Yes? How do you find your design? I just feel that the complexity does not depend on gamma. Gamma is part of the problem that should appear. It does not appear in these... But why is that a desirable thing? Maybe you are getting something weak by compromising over all gamma. Well, you could argue in practice, you could make... If it depends on gamma, you can make it arbitrarily bad by making gamma close and close at one. But this is, I will tell you purely for the other theoretical curiosity. There is a team of strong algorithms. In terms of strength being defined, in terms of not having the dependence either on B or on gamma. In practice, value iteration works like that. Gamma... So gamma is... It is there in Wellman's equation, right? There is value... Correct. Do not have a conflict. It is assumed to be there. No, I think, of course, the question is why do we want bounds that do not depend on gamma, right? Gamma definitely determines what your solution is. Okay. So precisely what have we contributed against this backdrop, let us look at it. Here, programming gives us Pauli and Kb. Additionally, 1 over 1 minus gamma is value iteration. Policy iteration has different variants with different complex analyses. The best we know of is Mansour and Singh's randomized version which has something like K over 2 to the n in expectation, multiplied by Pauli n, all right? So we've given algorithm called PGPI, which improves this bound, right? From K over 2 to the n to square root of K to the n. So the bound remains exponential. Except that rather than K over 2, we have square root K in the base, all right? This is a very simple randomized algorithm, which I will explain in some detail. The key construction that we rely upon for analyzing the algorithm is a total order on the set of policies, which I will introduce to you. And as you will see, the analysis is very straightforward. We also show that by combining PGPI with randomized policy iteration, you can even improve upon this slightly, all right? You can do slightly better than K to the n over 2, especially for K equal to 2. But it is worth mentioning at this point, okay? So this is something that I'm not too certain about. But it is actually true that you can derive these strong bounds also for linear programming, all right? You can do simplex, for example. So simplex, it has been shown the standard way of doing simplex can have these exponential chains, all right? But randomized versions of simplex where you pivot randomly, some of them have been shown to have sub-exponential complexity. The number of steps that you need to hit your optimal solution is sub-exponential in the number of variables and constraints. Now, it turns out for us that the number of variables and constraints are n and n times k, right? One way or the other. So it's not completely clear in my head whether the sub-exponential bounds here directly lead to sub-exponential bounds in the MVP regime. Maybe they do, maybe they do not, right? It's really the burden of proofs on me to read the linear programming papers well and the next version of the stock have this question resolved, right? Do you see what I'm saying? So this is exponential, right? It's k to the order of n, right? But if we had k to the square root of n over here, that would be sub-exponential. It would not be exponential. So do these sub-exponential bounds for linear programming imply that similar bounds exist for MVP planning? It's not directly obvious because the number of variables and constraints are not n and k, right? They are n and nk, or nk and n. So some work is needed to understand this for a start. But then also you'll start, I hope, seeing parallels between linear programming and MVP planning. In particular, between simplex and policy iteration. They seem to be very similar in having this idea that you start where you are and then you locally see how you can improve and you improve, right? And in particular, randomized pivoting looks very similar to randomized policy iteration. But at this point, I'll describe my algorithm to you unless there are questions. Fine. This is PGPI. We are going to define a total ordering on the set of policies, right? So for every policy pi, let us define this number v of pi to be the sum of all the state values of pi. v pi of s is expected long-term return you'll get by starting at s and following pi. Add up all the v pi of s for all these states and call that a scalar associated with your policy, v of pi. Let capital L be any arbitrary die-breaking rule that we want. You can have it as the lexicographic ordering of policies or anything else that you want. L is a total order, right? It's an arbitrary total order on the policies. Now we are going to define a total order, let's call this succeeds, all right? a will succeed b which is defined as follows. For every two policies, pi1 and pi2 we're going to define that pi1 succeeds pi2 maybe let's call it exceeds. pi1 exceeds pi2 if in terms of this scalar value pi1 exceeds pi2 on the other hand if these then in the die-breaking rule that we have adopted pi1 exceeds pi2 all right? So if L is a total ordering then so is exceeds. Fine? What this gives us, it seems very trivial but it gives us I think in a very normal way a way to compare any two policies that we have. If you do policy iteration your standard analysis will at some point depend on a partial ordering, right? So if you go from pi to pi prime pi prime will state-wise dominate pi, in that sense pi prime exceeds pi, right? But you would certainly have two policies which do not state-wise dominate each other. We have one policy exceeding the other policy on some states and the other one exceeding this on some states. Here on the other hand because we're adding up these values we have this ability now to compare any two policies on using this definition of a total order. But note that the definition is not that thoughtless, right? There is some thought into it specifically our definition is such that the following happens. If we do policy improvement, if we go from pi to pi prime by doing policy improvement then indeed pi prime will also exceed pi in terms of this total ordering. So why is that? We just saw that if pi prime is obtained by doing policy improvement to pi then this is true, right? For every state it will be at least as good and there will exist at least one state in which it is strictly better. So now if you add up all these inequalities then you get that this definition wise V of pi prime will exceed V of pi and therefore pi prime will exceed pi, sorry, pi 1 will exceed pi 2. I'm sorry, I think this should be pi prime exceeds pi. Yes? Okay, so what do we have? We have a total ordering on the policies which respect the partial order, which respect the fact that if a policy state wise dominates another policy, so will it exceed this policy in the total order. But then any two policies in the total order can be compared. Okay, so now this gives us a very easy sort of way to design an algorithm for MVP planning. Okay, so instead of thinking in the space of policies for a second let's zone out of that part of the world and look at this, right? Again, here's the number line with numbers 1 through capital N. Alright, I'd like to like somebody to volunteer to play this game. You are born on some point on this number line. Okay, you're created somewhere and there are two things that you can do. One thing you can do is to increment yourself by applying the increment operator if you're at I you will go to I plus 1. The other operator that you have at your disposal is called a guess. Okay, what happens if you guess a number is picked uniformly at random between 1 and N. Alright, any of these numbers could show up, they show up with equal probability. And so for example, for example you might be shown this number if you guessed. Alright, you'll be in fact shown this with probability 1 over N. And if you want it, you can move yourself from I to that position. Right, you might also be shown 3 and if you want it you can demode yourself from I to 3. Your objective in playing this game is to increment and guess in any way you want with the objective of reaching the top right, reaching capital N as quickly as possible. The question is how many operations do we need to quickly reach N? So don't really bother about where maybe you initialize that one. Okay, let's take the worst how many guesses and increments do you need in order to reach capital N? Log N. And how would you do that? If you guess there's half probability that the guess would be somewhere on the top. Correct. But if the guess is below half probability it is down. Correct. Every time it's down you increment yourself. No, sorry, it's N log N probability. No, sorry. No, no, no. I think then you increment. If it's above then you just jump it, right? Why would this take long again? No, it would probably not take long. But the first on the first step you can with probability half go to correct. When you just need to with probability 1 by 4 you will be going at the end by so I think it will still take long. Okay, so that says what's the sort of, what's the trivial bound on this? N. Just keep incrementing. In a deterministic sense you will not take more than order N. But the hope is that by guessing you can have some of these dramatic jumps early on. So especially if you're, let's say here, right? With a very, very high chance if you guess you're going to end up somewhere above where you are. You should not be incrementing if you're way down. If on the other hand you're at N minus 1, right? Would you want to guess that out? So it seems like lower down you should guess. If you're higher up you should increment. That seems like the intuition but where is this inflection going to happen, right? How much time, how many operations will be needed as a function? A log N is a very reasonable guess. No, I mean given a policy you know correct. So the way I have described to you you actually know where you are, right? But when you map this back to NDPs we will not know. It doesn't make no difference. The difference is really like a constant factor. The difference between knowing and not knowing is essentially a constant factor. Even if you did not know where you are, right? There is a way of guessing and incrementing by doing which you will only be a constant factor worse off than what you could do if you knew where you were. So assume you know where you are. Is this something worth that? No, no. You just guess operation is as follows. Regardless of where you are a number will be chosen uniformly at random from 1 to N. It doesn't depend on where you are. And if you want you can move to that number if you don't want you can stay put. Square root N should do it. Second guess of the answer. I mean it's going to be less than N, right? So it has to be square root N or log N or N to the 2,000 or all of these things, right? So indeed, yeah, square root N is the best you can do. I'm not going to analyze it but what essentially happens is you can show, right, this game indeed sets up a Markov chain, right? From any of these points, right? You can either guess or you can increment. And one of them is the optimal thing to do because you can define a reward of minus 1 for every operation that you incur, right? And the value, the long-term value of any of these of a policy from any of these I'm always tempted to call these states, right? But I don't want to get confused with the states from before. So many of these notches, right? There is an optimal thing to do either guessing or incrementing. And you can show, right, that the optimal policy that gets induced will have this form, right? From N through some threshold the right thing to do there and upwards is to increment, all right? And from everywhere below the threshold the right thing to do is to guess. You'll see that the threshold is something like N minus order of square root of N. Square root of N is roughly the right thing you can do. So this is how you might do it. So, Supratik's question was whether we know where we are or we do not. The algorithm where actually we don't even need to know where we are. It's just enough for us to know whether the guess has taken us to a point that is above us or below us. We don't need to know the indices themselves. So the algorithm is essentially you guess square root N times. Each time if your guess takes your point that is above you, you just go there. Having done this square root N times you just increment. It's possible to show that after square root N guesses with enough probability you will be in this zone already in the top square root N, and therefore you can't take more than square root N more if you just increment it from there. So here's what you'll do. You'll guess. You'll given a point that lies above you. You'll move yourself there. You'll guess again. No, this falls below. I will not go there. I'll stay put. Guess again, go up. Guess again, go up. Guess again. No, I'll stay put. I guess enough times. Now I'll increment. So if you guess square root N times you'll be in this green zone and then if you increment again you'll get an additional square root N. So this is the algorithm. There are other ways in which you can do it. But the square root N is a sort of fundamental element that you'll reach. So the expected number of operations for playing this game is order of square root N. So what's the connection to this game and the MDP planning problem? So each of these points is a policy. Let's order our policy. You arrange your policies. There'll be an optimal policy there on the top and the worst policy in some sense will be here. You guess. What does it mean to guess? You pick a policy uniformly at random from all your policies. That's very easy to do. You just go to every state and pick an action uniformly at random. You keep doing this. You'll get a policy that's drawn uniformly at random from your set of policies. You evaluate your policy. Your current policy and the one you have guessed. If this policy dominates your current policy in the total order you switch to it. If it does not you don't switch to it. You do this square root of capital N times where capital N is the total number of policies. Then onwards you just increment. In fact, by doing policy improvement, you'll guarantee that you'll go up at least by one. In practice you might go up by more than one notch by applying standard policy improvement. So this is our algorithm. Start with an arbitrary policy. Take alpha to be half, for the present. With these square root of the total number of policies draw a policy uniformly at random. If pi prime exceeds pi switch, otherwise don't switch. Until you reach an optimal policy from wherever you have reached at the end of this repetitive phase you just do policy improvement. Essentially the analysis of that game translates to the analysis of this to show that this will not take more than order of K to the N over 2 policy evaluation than expectation. That's with alpha equal to half. Now you can do slightly better than this by picking alpha to be 0.46 if you are agreeing to use randomized policy improvement as your policy improvement. Any policy improvement will give an increment of one. But slightly more tighter analysis is available if we use monsoon and things randomized policy improvement. So if we did that, alpha equal to 0.46 suffices for getting a bound of 2 to the 0.46 N and K equal to 2 you also get a tighter bound for general K but it doesn't look nearly as neat as this. So what have we seen? So we have this randomized game for which we demonstrated that a simple randomized guessing strategy works. It takes something like the order of the square root numbers which we then translated to the case of MDP planning which is the number of policies and by doing this we get the bound that we had claimed and we can also show that if you can do slightly better by applying randomized policy improvement. Yes. This is the way to do it. If you are into saying that select that many random policies pick the best one with high probability to belong to the top that's exactly, the lemma in our paper is basically that that if you guess um if you guess capital N to the alpha times with enough probability you will be in the top N to the 1 minus alpha and so alpha if you choose it to be half you get the first result. So the only catch is that time breaking doesn't, so somehow we are assuming uniform distribution of those time breaker and the lexicographic comparison doesn't mess it up. No, I'm just wondering that when we talk about choosing, if there was no lexicographic time breaker then the probability works out nice otherwise there is some uniformity assumptions of the time breaking that has to come into the play for in this particular. Right, so you need this just to be like technically precise because it is possible that you could have different policies with the same values, in practice it might never happen but if it does happen then when you are ordering you do need to have a well defined time breaking. It doesn't matter what it says any ordering on those things the real progress is happening because of the policy improvement it's not happening because of the time breaking in fact there could be multiple optimal policies too as we will find the one that amongst those also wins the time break. So I will go to like a conclusion of this so we introduced this algorithm which improves upon the known bounds in terms of N and K alone which is strong in the sense of not having dependence on the size of your representation or the discount factor to randomize the algorithm I cannot really help not think about the fact that using policy iteration for MDP planning seems very analogous to using Simplex for linear programming so what is Simplex doing? You start at some vertex of your polytope you keep looking for your neighbors and the ones that improve your objective function you switch to one of them based on your pivoting rule either randomly or deterministically you keep doing this in the worst case Simplex can take exponential time if you're doing deterministic pivoting using the standard Simplex whether if you randomize you can actually get some exponential time that's what some of these papers have not mentioned here but from the world from the 90s says it's randomized Simplex will give you some exponential time so in the policy iteration we see that how words can take a long time but when you randomize you get better performance in both cases we get these bounds solely in terms of the number of constraints the number of states and actions in MDP the number of constraints in variables for Simplex no dependence on the size of your representation unlike the interior point method turns out that we're not getting very good experimental results with PGPI for the reason that this bound for policy iteration seem to be extremely conservative on most MDPs right if you say I'll derive drama MDP from a distribution I'll draw the rewards from Gaussian and the transitions from a Dirichlet something like that you'll end up in an MDP which takes policy iteration like 5 iterations to converge it's remarkably fast and it's not even understood why policy iteration is that efficient so our hope we assume would be in MDPs which is a truly difficult for policy iteration right so very recently like 5 years back finally construct an MDP which takes 2 to the N iterations for policy iteration for how much policy iteration it turns out though as I mentioned that finally MDP is one where the number of actions is variable per state you don't have fixed number of actions per state and therefore the number of policies is much much larger than K to the N or 2 to the N because you have lots of states which have as many as order N actions and so on those MDPs the square root of the total number of policies is also something like N to the N or N to the N over 2 which is much bigger than 2 to the N so we hypothesize that our method is going to work well on problems which have bounded width which have like a small number of actions constant number of actions per state on which you can still show that policy iteration takes an exponential change so this is a well defined problem for you, is there an MDP with K actions per state with N states on which there is an initial policy from which if you start and did how much policy iteration you will still have an exponential change you will have some C to the N iterations that you will have to negotiate before you find the optimum policy where C can either be constant on a function of K we are not aware of such work here and for that reason essentially PGPI turns out to do a lot of wasteful guesses if you just did policy iteration you are going to do so well so what is the point of doing all these square root of number of policies guesses it is good in theory it is not very clear that it is good in practice on most practical MDPs so these are references to the long history of this and some of these have been open problems for a long long time so it is really surprising that a very simple method gives a very significant improvement in the exponential bound but if anything I think this opens like slower possibilities right especially the connections with linear programming seem too too tempting not to look further into so if there are students would like to take a look at me very happy to talk with you with that I thank you very much we are constructing a linear order we assume you pick any time-breaking rule you want maybe just stick with the lexicographic ordering of policies but if we define a scalar for each policy that is the sum of the state values then combine with your time-breaking rule that imposes a total order what are the no I am not very there is some work to put on the link between policy iteration and the dual of the LP formulation correct so the dual has variables like the stationary so doing the simplex and the dual seems to be very close to what policy iteration is so maybe the counter example like the 3 mental kind of counter example could give you some no I am not aware of this this is a very specific linear order no but there have been preliminary examples on MDPs this is published I do not remember the citation but there is a preliminary type of example for MDPs but again only on the simplex approach this is a different approach it is not the LP so there is also for the simplex on MDPs it is known that it is strongly I do not know no if you assume like a constant discount factor or something like that we would not depend on gamma or on B if you can get strongly for MDP planning and you are a busy student you can write your thesis there was something but I do not know maybe it does it assumes it assumes the disc of gamma there is a dependence on gamma if you have a constant but anyway this is useful that the dual will look like doing simplex on dual looks like policy iteration logic no it does not fix it no I think we are going after more targeted questions this is a very specific problem the whole LP may not be sorry correct no the dual will also not have the dependence on doing simplex in general will not have the dependence on B which is why you will get a sub-expansion rather than a polynomial number of iterations because this is at this abstraction you are looking at policies either one step improvement which you can figure out by computing or a random point in the entire order but we could try to look further into these policies and so now my vertices are not policies are that abstraction but say state action pair or something like that more structured more structured and walk in that space that is very good intuition the only problem is it is very difficult to intuit in this space you can construct examples where doing what seems to be writing in many states might end up being the wrong thing correct yeah so this also ties in with I think our first guess is log n if you can do log n for playing that game then you can do this in strongly polynomial time and the way of doing log n would be to generate guesses uniformly at random from only those policies that are between you and the optimal policy there is some way in which you can only guess policies that are already guaranteed to be better than you at least with high probability I was trying to say that you get a policy as a vector and I sort of do bitcoins improvement on that way so first few bits of my policies are one on one then I should only guess on the remaining bits of the encoding of my policy or something like that say the best policy is all ones and the worst is say all zero there was a way to input policies in this way and there is sort of bitcoins so by the way the best policy is all ones the worst may not be all zeros it could be all ones except the last one which is zero so such an encoding cannot exist maybe you are pushing it under that right? your encoding is doing the magic Sivram your increment increments the value so it is not moving it is not I plus one on your scale increment in our algorithm is increasing your policy it goes up somewhere it increments the value it increments the value correct the scale or associated with the policy so why do you need this other ordering you just pick root n randomly choose the best and then you increment the value but you will always do it within square root we do it in square root n you do not need the total order at all is what I am saying the total order is to say that you are in the top square root n it will be anyway you have to define what does it mean there is a partial order anyway increment you move in the partial order correct but your guesses need not be on the chain it does not matter but you will be on a level where the top square root n is so you will only move up square root n after your best so you will get it how much mass there is at each level of this partial order because of that square root n you will get the worst off with your total ordering because your partial ordering say suppose any of the path you can be within root n when you do total order all this path gets super import linear so you actually may be below therefore the partial ordering may be only less on your path no no see the top square root n they will fall on some in some order some order the first you order by whichever way you want you will certainly be above this because the number is greater than square root n and then you know the partial order you will be above square root n mode correct so you say you do not need the total order I guess yeah you could say that actually it may give you better handle anything the partial order may give you better bound than total order or it may give you the same so you are saying you are saying the length of the path the longest yeah square root n below it because the number is greater if I factor partial order then you will be much above so if actually you knew something about a partial order then even fewer square root n guess this might do but then we do not we are making no assumption about what we we are not even using your total order we are defining that we are in the top square root n we are going to take the best of your square root n guesses what does it mean best under what definition no because let's say n is 9 and you guess 3 times you might have 3 policies none of which state wise dominates it might be incomparable in terms of state wise domination you still have to pick one of them and call it your best so that we need the total order choose the largest value largest value largest value as defined by our definition that's the definition I am calling like synonymous with the total order this d5 okay one second let's go to here no let's go to the total order okay if you have 3 things that you will choose anymore yeah that's exactly what he is saying I am saying your L can be anything I am saying the same thing you are saying except I am saying upfront what how we choose no no just look at dpi don't look at d of pi but dpi is a vector dpi is a vector okay so this is what you are talking about right this is what you are talking about right no once in a while each policy is a very different yeah for each state dpi is a value function right it's a vector from somewhere else yeah from somewhere else no this is what we have value function of a policy is an n bar n we need to call as all of them into one number that's exactly what total order is what's the partial order like the longest chain in the partial order then you could use that in your budget correct it's a big one good no more questions means we will stop