 So, I will talk about a very nice result of Oliver Friedman, Thomas Hansen and Uri Zwick on proving lower bounds for the randomized pivoting rule for the simplex algorithm and I must say these slides are by Thomas Hansen, I just made small modifications to them. So, I will just begin by I guess, sorry defining what the simplex algorithm is, many of us may have used it, but still want to remind what it is and do you state the problem. And the nice aspect of I guess this paper is that it draws connections to so called Markov decision processes and we will see what these are and how we use the MDPs to derive these lower bounds, okay. So, just a quick reminder, a linear program is an optimization problem which has a linear objective function and a set of linear constraints, okay. And I can represent these set of linear constraints by matrix A, where D is the dimension of the variables and A is an N by D matrix. And it is easy to check that the, sorry the set of feasible solutions to this inequality forms a polytope and the, there is always a vertex of this polytope which is the optimal solution. So, the vertices are also called basic solutions, basic feasible solutions and it is well known that for any linear objective function the optimum value can always be found at one of the vertices, okay. So, when you want to optimize it just means that I just need to find the best possible vertex, okay. So, now this, before we talk about the simplex algorithm, we will first put the linear program in the standard form. The standard form looks like this, you have a linear objective function and then you have a set of equality constraints, AX equals B and the only inequality constraints are these non-negativity constraints, okay. And it is easy to check that you can easily put any, any sort of this kind of linear program in the standard form. So, all you have to do is introduce these slack variables, so set of new variables let us call them Y and say AX plus Y is equal to B and Y is at least 0, okay. And if you have these variables X which are not required to be non-negative, then you can replace them by two variables X plus and X minus, X plus X as difference of these two and make both of them positive. So, I can easily put any sort of linear program in this form, okay. And this is the form of the LP that we will work with in the rest of the talk, okay. Now what is a vertex solution here? Vertex remember corresponds to a set of, so the number of variables here is M which is D plus N which is bigger than the number of constraints, think of A as a packed matrix. So if there are M variables you need M constraints, the M equality constraints to form a vertex. N of these are coming from AX equals B, so you need M minus N more from the inequality constraints, okay. So what will happen is that a vertex will correspond to exactly sort of D non-negative variables. The rest of them will always be 0, right. So you have AX equals B, so there are N of these and then you have X equals 0 and there are M variables. So you want, N of these equalities are coming from here. So you want M minus N of these to be satisfied with equality. So only N of these will be 0. So a basis is defined as a subset of columns essentially which are linearly independent. So remember you have M minus N equality constraints here, so those variables will be set to 0. And the only N variables which form part of the basis could be non-zero. They could still be 0 but in general will not be 0. And so that's a vertex and how does the simplex algorithm behave? It starts with a vertex and goes to a nearby vertex, okay. So that is called pivoting. So what pivoting does is that you start with a basic solution, a vertex solution and among the neighborhood you figure out which vertex is better than the current vertex and you move to that vertex. Now in terms of the actual algebra it basically says that you are given a set of basic, you are given basic feasible solutions. So what you do is you take one basic variable out and bring one new basic variable in. So you take out one equality constraint, make it inequality and bring a new inequality constraint and make it 0. That's like moving along an edge. So let's see the picture. So you start with let's say this basic solution here and there are three edges here and let's see you move along this because let's say your objective function points along this direction. So you move along this you improve the value of the objective function and when you are here you again move along this because I mean both of these directions will improve the objective function. So you pick one of these directions and let's say you move along this direction. Now when you are here you could either move back or you can move along this direction but because you want to locally always improve your solution you will always move along the direction in which you are objective function. So that is the simple example. So now it's because the dual yes so I have not proved that if you are at a vertex so that you cannot go to a better vertex then you are at an optimum. So that it follows essentially because everything is linear. So as long as you can move along a better direction you are improving the value of the solution. If you cannot improve the value of the solution then you are at an optimum. So let's see a quick example. So this is let's say you are given an LP of this form. First of all you put it in the standard form by introducing this slack variables. So I add some more variables x4, x5, x6 so that it comes to the standard form and now I'll start with a basic solution. So remember a basic feasible solution will what it will do is because there are 6 variables it will have it needs 6 equality constraints. So it will set 3 of these to 0 and the other 3 will form part of the basic solution. So let's say you pick a basis here which is basic solution which is 456. So what that means is that I want to set x1, x2 and x3 to 0. And if I write my set of equality constraints in this form it's easy to read off what x4, x5, x6 are. So the value of x4 is 1, this one is 2 and this one is 1. So my current solution is x1 is 0, x2 is 0, x3 is 0 and x4, x5, x6 are 1, 2, 1. So this is the current point. And now if you look at the objective function you can immediately see how I can improve my solution. So the coefficient of remember this is maximization problem. So the coefficient of x1 is 2. So if I slightly increase x1 my objective value will only improve. So that's what I try to do. I increase x1. Now you can increase x1 as long as you don't violate the inequality constraints. So when you increase x1 what is going to happen is that at some point of time x4 will become 0 and you will stop at that point and you will get a new basis. So you start with the solution and then you move, you have three options and you can either move from this vertex to this vertex, this vertex to this vertex or this. But because you want to improve your solution you will always move along the way. So that's, so you go to a new vertex 1, 5, 6 and again when you look at the objective function at this point you see that I can either increase x2 or I can increase x3. In both these cases I will be able to improve my solution. So I pick one of these. Which one to pick is again what is called the pivoting rule and that's where all these sort of interesting things happens. So let's say I pick one of these. So suppose I increase x2 then x1 will always be okay. The x5 at some point of time will become 0. You will have to stop at that point and so on. So you keep moving in this manner till you reach a point where when you look at the objective function this is how it will look like. All the coefficients here will be negative okay. If you reach a point like this you know that you are at an optimal solution, why? Because all, right now your current solution puts x4, x5, x6 as 0 right. Now if you don't put them 0 you will only decrease the value of the solution. So you are at an optimal solution. So that's how this method works. And now there is a theorem which says that assuming that you know the polytope is feasible and it's not unbounded, one of these two things will always happen. Assuming we don't, yeah so if everything is negative then you are done. If there is some coefficient which is not negative then you can always improve the solution. And I'm not talking about the degenerate cases which can happen. If some of the coefficients are 0, even if everything is 0 let's say or everything, yeah so it's not. So if all coefficients are negative then it's a solution. So what are the choices here? The choices that we have to make are, as we said when you have several non-basic variables which with positive coefficient in the objective function you can, let's say here for example you can either increase x2 or you can increase x3. So which of these two should I pick? And there are very natural rules which allow, say tell you what to do. And in the case of a tie, suppose let's say I'm increasing some variable here, let's say here I'm increasing x3 and let's say two of these simultaneously become 0. So which one do you take out of the basis? And this is what is called the derivative, tells you which along which edge, essentially tells you which along which edge you should move. Now the first such rule, natural rule was by Danzing, it said that given that you want to locally improve your solution as much as possible, why not in pick the variable which has the highest negative, highest positive coefficient. So that now there were some problems with this rule, the main issue with this rule was that it led to what is called, it led to what is called cycling. That is in case of a tie, it can lead to situations where you can keep on cycling. So I'm not going to that, in fact none of the LPs that we will write will lead to any degeneracy. So the second rule was called Blank's rule, it says that among all the variables that are eligible to be increased at this time, you pick the one with the smallest index and similarly for leaving the basis. And this rule was nice because it guaranteed that will never lead to cycling, it will always converge. So now then people said that, okay so let's say we have proved convergence, but how much time can these things take? And in 1972, Clee and Minty showed that the first rule here, so let's assume that you never even get to the case where cycling happens, even then it can lead to exponential number of steps. So and the examples were like this, so let's look at the following optimization problem. You want to maximize Xn and where you have a set of linear constraints here and the set of polytops defined by these set of linear constraints really looks like a cube. But almost like a deformed cube, so if epsilon is 0 it's exactly a cube, for very tiny value of epsilon it really resembles a cube. And so I can still index the vertices as 0, 0, 0, 1, 0, 0 and so on, even though these are not exactly those coordinates, but you can think of it as a cube. And if you're starting from 0, 0, 0, I can, the optimal solution is of course at the top here because you want to maximize the smallest index Xn. So I can move from here to here in one step. But what one can easily show is that if you're following this rule, if you start from here I can keep on moving along a Hamiltonian path in this graph. Think of the edges as forming a graph, then I'll move along essentially all the vertices in this graph. Each of them will be an improving step. In fact, the largest coefficient rule will in fact pick this long path. So subsequent to this people proposed many other pivoting rules. So then they said that instead of looking at the largest coefficient, let's look at the one which leads to the largest increase in the objective function. Or why not look at the one where the edge has the highest slope. Then of course the Bland's rule, then there was this shadow vertex rule which I will not go into. But this was the pivoting rule for which it was shown that the smoothed complexity is polynomial time. So the whole set of rules, but for all of these people could come up with examples for which the simplex will take exponential number of steps. Now a little bit about what does it say about the diameter of the polytope? So remember the diameter of a polytope is the maximum distance between the two vertices in the polytope. And no matter what algorithm I use as long as I just walk along the vertices of the simplex, a lower bound that I can come up with is the diameter of the polytope. The diameter is large, then I'll obviously take a lot of steps. Now there is very famous conjecture, it's called Hertz conjecture which said that diameter of any n facet polytope in d dimensions is at most n minus d. So it will say that the diameter is always small. So perhaps one could imagine that there would be an algorithm which will take small number of steps. Now very recently a counter example was given to this bound by Santos, this is 1 plus very small number times n minus d. So this Hertz conjecture is not true. But it is possible that the diameter would be let's say n plus d or small constant times n plus d or some polynomial n plus d. So those things could still exist. Now one natural pivoting rule would be how about among all the possible edges which are improving my solution, why not take a random? So this was called the randomized pivoting rule and it was a big open problem for last 30 years to see to analyze this rule. So let's say that you can start from let's say this is the diameter, then I can always find an objective function for which this is the optimal solution and if you start from here then it could be. So in this paper the authors show that even the randomized pivoting rule can lead to exponential number of steps. And however this does not say anything about the diameter of the polytope because in fact the polytopes that we will see here will have very small diameter, something like n, linear in the size of variables. So this is the pivoting rule that you pick a random non-basic variable with the positive which has a positive condition. Similar to the random edge rule there was another rule called a random facet rule. So the random facet rule says the following that when you are at a vertex there are several facets on which this vertex lies. You pick a random facet and you recursively solve the problem on this facet. So let's say you come to a vertex in this facet and now if it helps you would like to leave this facet. That edge which takes you out of this facet if it helps you would like you can try to leave that. So and for this random facet rule Kallai and others had shown that it takes sub-exponential number of steps, something which is less than 2 to the n. And it was an open problem to give a lower bound for this rule. So the main results of this paper shows that the random edge rule requires 2 to the n to the 1 4th pivoting steps in expectation and similarly random facet rule requires a sub-exponential number of steps. And so it says that even the randomized pivoting rules do not help in this case. And the nice thing about this paper is that all the previous lower bounds for pivoting rules were based on some on constructions which relied on some deformation of cubes and some simple polytopes like these because it's hard to sort of think about polytopes which are very complex. However, by going via this by Markov decision process says what we will see is that the kind of polytopes we construct are very complex but still we can talk about them. That is there is a well-known algorithm called policy iteration algorithm which we will talk about for solving MDPs turns out that it's exactly the simplex algorithm running on a related LP that we will see. So by talking about this policy iteration algorithm we will be saying something about the simplex algorithm. Say that the same technique has been used at several other places. For example, there is a rule called Zade's least entered pivoting rule. Unlike the other pivoting rule, this rule maintains a memory. So it remembers that which variable has been added to your basis how many times. So when you come to a new vertex you add that non-basic variable which has been used the least number of times. So it has a memory. But even those these kind of rules have been shown to it's a very powerful technique. So now what I will do is I will talk about Markov decision processes and see where the connection to linear programming comes. So what's a Markov chain? A Markov chain as we all know is an you can think of it as an n by n matrix which suppose it has n states and the rows of the matrix adds up to 1. So Pij tells you that if you are in state i what's the probability of going to state j. So let's say you start from state 1 then you from state 1 you will go to state 2. Now at state 2 you have half of probability of going to either 1 or 2, 3 and so on. Let's say you start with state 1 then next step you go to state 1. Now at the next step I can go back or forth and so on. And I can maintain this here I can maintain so initially I am with probability 1 let's say I am a state at state 1 and then I can represent this as a vector and as I take steps I can maintain a vector which represents the probability distribution at that step. So at step 3 I am at state 1 with probability 1, 4, state 2 at the probability half and state 4 with probability 1, 4. And the way I can do this is let's say b is your initial probability distribution in this case b is let's say 1, 0, 0 and as you take more steps the next probability distribution will just be given by right multiplying the current vector by p. So I mean it's not important but okay so you can kind of follow as you take more steps it just corresponds to multiplying by this matrix b. Now let's talk about Markov chains which have rewards in them. Here it's exactly like a Markov chain except that you have a reward on each state right. So whenever you come to a state when you take the next step you get that much reward. So you have a vector c which tells you what is the reward for leaving each state i and now you would like to know that if you start with a particular distribution what's the total reward I will accumulate as I keep on moving in this Markov chain. Now so remember this b transpose p to the k tells you the probability distribution at step k. So this vector here this inner product tells you what's your expected reward at step k and when you add up over all values of k it tells me my total activity over time and this could be unbounded. So normally what people can do people do is you either have a state which is an absorbing state. You will always reach that state with probability 1 and then once you get there then you just stay there. So that will basically truncate this series after a finite amount of time or people also add discount factors to make this conversion. In our case all the Markov chains that you will build will always have a terminal state where you will always reach with probability 1. And the last thing is that the value of a state i is the total reward that I expected when I start at this state. So now I can define what a Markov decision process is. So it looks like this. You have a set of states so here for example there are 3 states and associated with each state I have a set of actions. So with state i I have a set of actions ai. So for example here with state 1 I have two actions a1 a2 with state 2 I have two actions a3 and a4 and so on. So the idea is that when you come to a state you will take one of the actions. Let us say I take action a1 and then this state tells me how much reward I will get. So suppose I take action a1 so I get 7 units of reward. It is like a Markov chain with reward. And once I get to this state it is exactly like a Markov chain it gives me a probability distribution on which what is the next state I will go to. So let us say I am state 1 I take action a1. Now when I am here I will go to state 2 with probability half or state 3 with probability half. So more formally you have a state and with each state you have a set of actions. Now each action has an associated reward which is a which could be a negative number and also a probability distribution such that this distribution tells me that which is the next state that I will go to. So and the point is that you would like to decide on a strategy for each state. So for example if I am at a state what I would like to know is that which of the two actions or which of the three actions I should take. So for example think of this policy tells me that if I am at state 1 I will take action a1. If I am at state 2 I will take action a4 and so on. Now once you fix this strategy it becomes like a Markov chain with rewards. Because now there is no indecision at any state you know exactly which action to take. So it exactly becomes a Markov chain. So that is what a policy is, it tells you which action to take at each state. And once you fix the policy it becomes a Markov chain with rewards. So you can again define what is the value of a state with respect to a policy. And a policy is optimal if it maximizes the values of all of the states. So among all any other policy pi and state i the value that I will derive from pi star will be higher than any other policy. So such a policy I will call as an optimal policy. Now it is not, it is a non-trivial fact that there always will be an optimal policy. So there is a single policy which will simultaneously maximize the expected value for all the states and the goal is that you want to find this policy. So that is what I am going to do. So that you can show that randomized policy will not help you, right. So if there is an optimal policy which is randomized there is also one which is just it. So you might as well just focus on these things. Now before we go on a little bit of notation. Instead of representing the Markov decision process in that manner I will have a more sort of compact way of representing them, this is called a graphical representation. The way it works is the following, like an MDP, this is an MDP so you have a set of states which I will denote by these circles, so there are four states here and then you have some diamond shaped vertices which will tell me what the rewards are. And third kind of vertices I will have are what are called these randomization vertices. So they all they do is that when you come to a randomization vertex it will tell you what is the next vertex you can move to and will tell me a probability distribution over that. So if you come to this state it will tell me that with one third probability you go here with two thirds. So that is what a randomization vertex is. So the only difference with this earlier definition is that this randomization vertex can be sort of part of a graph. It could be shared by several states. And it is easy to check that if you have, so how do you move here? Let's you start from this state. Now you can either go to state t directly or you can come to this randomization vertex. If you come to this randomization vertex what it will do is that with one third probability it will take you here, with two thirds probability it will take you here. And in this way you can move in this graph. So it is a graph where you can, if you are at a state you will take an action, an action of the set of outages, outgoing edges. If you are at a randomization vertex you will come to a new vertex depending on the distribution. And if you are at a reward vertex it has exactly one outgoing edge. And you will collect that reward and go to the next vertex. And just a compact way of representing an MDP. You can write an MDP corresponding to this but that will be just a bigger thing. And again the policy is a set of outgoing edges, one outgoing edge for every vertex. So once you fix a policy you can figure out what is the value of each state. For example here in this state you will collect zero reward. If you are in this state with two thirds probability you will not collect any reward. With one thirds probability you will collect six units of reward. So the value here is two. So you can easily compute the value of each of the states. So the goal is to find a policy which maximizes the sum of the values. Because we know that there is a single policy which is optimal for each of the states. So it is also optimal for the sum of the values. So that is what you want to find. And if you think about it, let us think about it in the following way. Suppose your initial distribution is uniform distribution. So I am just writing it as all ones. I am taking out the scaling factor. And this is at the next step you come to a new distribution, at the next step you come to a new distribution and so on. What is the value of a state? The value of the state is the sum of all the column entries for that state, right? Scaled by ADC. So once you fix the policy, the value of a state is you just add up all the column entries. So in other words, let XA be the number of times a particular action A is used, okay? So that will be just, so for example if action A is going out of state 3, then XA will be just the sum of all the entries for state 3, okay? So the total value of all the states is, I can write it as sum over all states i. And remember you will choose exactly one action here for in a policy. So CA, the how much benefit you get out of that action times the number of times you will use this set. So let us write a linear program. So for example, this is your objective function. You want to maximize the total value that you accumulate. And XA represents how many times you will use this action A. So if this is a state and these are the outgoing edges, these are the incoming edges, then the claim is that the number of times you take these outgoing edges is equal to 1 plus the number of times you will enter this state. Why? Where is this one coming from? This one is coming from this initial, this initial one here, right? Because you are initially giving a charge of 1 to every state. And when you move around, whenever you enter a state, you will leave the state as well, right? So it's like a flow conservation constraint. So whatever is the inflow at a vertex is 1 plus the outflow. Sorry, whatever is the outflow is 1 plus the outflow is all actions. You add up the XA values and the inflow is essentially this quantity, right? So for every other state, what's the chance of coming to the state and you sum it up if that gives you the inflow, okay? So let's write the LP for this MDP that will make it clear, okay? So let's not worry about the actual policy right now. So you have how many variables will I have? I have one variable for every action. So for this state I have two variables, for this state there are two possible actions. So I have two variables and again I have two variables here, so there are six variables. What's the objective function? So if I take action corresponding to X1, if I take it X1 times, each time I take this action I have one-third probability of getting six units of reward and with two-third probability I go to a new state, okay? So this action gives me a benefit of two, that's why I have a two here, okay? Similarly if you look at X2 it directly takes me to the sink, so it gives me zero reward. If you think of X3, if I am at this state then X3 will take me to this state, it will give me minus four reward. From here I'll go here and I know that from here I'll get a reward of two in expectation. So that's why I have a minus two here, okay? For X5, here I again the same thing, exactly the same thing. When I take the action corresponding to X5 I'll get minus four units of reward and then I get plus two more, so that's why minus two. And if I take X6 I'll always get minus one unit of reward and then I go to a new state, that's why I have minus two, okay? So I can write the LP in this form, okay? Let's look at the constraints. So if you look at, let's say this vertex, what are the possible actions? Well there is X1 and there is X2, right? So X1 plus X2 is one plus the total incoming flow. What's the incoming flow? If I take action X1 then there is a two-thirds chance of coming back, right? So that's two-thirds X1, right? So each time you take X1 there is a two-thirds probability of coming back. Similarly for X3, sorry, similarly for X3 whenever you take X3 there is a two-thirds chance of coming here, so that's why it's a two-thirds X3, similarly for X5, okay? So that's the constraint corresponding to this state. For let's say this state, the outgoing is X3 plus X4, it's one plus, if you take action X6 you will definitely end up here, otherwise you will not end up here. So it's one plus X6. Similarly for state, this state X5 plus X6 is equal to one, okay? There is no incoming flow. So that's the kind of LP I can write down for this, okay? So this is the LP and now the question is what's a basic feasible solution for this LP? Now I claim that the basic feasible solution for this LP is exactly a policy, right? Remember how many non-zero variables can I have? If n is the number of states it's easy to check that you will have exactly n non-zero variables, okay? So you have one non-zero variable corresponding to each state, okay? One of these variables must be non-zero, why? Because you have a one here, okay? So for each constraint one of these Xs must be non-zero, all of them cannot be zero, okay? So exactly one will be non-zero and because you have n states, so there's one non-zero variable for each state. So the basis consists of exactly a policy, okay? So that's what a policy is, okay? And in fact the converse is also true, a policy pi corresponds to a basic solution. So it's not true, okay? And what does a simplex algorithm do? Simplex algorithm says that if you write a current basic solution, move to a different basic solution, where it just changes one base variable. So essentially the simplex algorithm corresponds to the following, that look at a state, if this is the current action, can I improve the solution by changing this action to a different action, okay? And every other action stays as it is. That's what a simplex step is, okay? So for example, here if you look, if I change this action from this one to the other one, the value of the state goes up from zero to, so that's an improving step, okay? So that's what the simplex algorithm is. So the nice thing is that instead of thinking about polytopes and moving along edges, I have been able to think about in terms of this local kind of algorithm, right? That you have a policy and you just change one action in this policy and see if it improves your solution. If it does, then you take that action. Now for MDPs, what you can also do is you can simultaneously improve several of these actions, okay? So let's say I change this action to this one and maybe this one to this one also. I can change two or three actions simultaneously, but we'll not go into that, okay? So this is the algorithm. So the algorithm says the following, while there is an improving switch, so you're given a policy pi, update pi by performing a switch, okay? And by switch I basically mean that you take an action and change it to something, some other action from that state, right? So and the simplex algorithm is a special case of this algorithm where you're only taking one improving state at a time, okay? This is the well-known policy iteration algorithm and simplex algorithm exactly corresponds to this, provided you just improve one action at a time, and so this is all we need to analyze now, okay? So let's look at the example. So this is the LP corresponding to this policy, okay? So this is the basic solution, and I've written it in this form then. And let's say that you realize that the coefficient of X1 is positive. So let's try to increase X1 and that corresponds to changing this action to this new one, okay? That's what it is. Now again, you see that if you switch this action to this one, you will improve the solution, so you take that action and so on, okay? Keep on moving, you'll come to a solution which is often, okay? So the theorem is that if you cannot find a locally improving action, then you're at an optimal solution, okay? It's the same as the simplex algorithm, okay? So now I'll briefly talk about how you use this for proving lower bounds for the simplex algorithm. And remember that what does the random edge rule correspond to? Random edge rule corresponds to that among all the actions which can improve your solution, just pick one random action, all right? So what I'll do is I will first see the lower bound for a special case where we don't use randomization at all. Just use Bland's rule. And once we see that, we'll see how to add randomization to it, okay? Just some notation. So remember these corresponds to vertices which tell me how much reward there is. And all the rewards are sort of exponentially increasing in nature. So if I say reward is five, it basically means that it is minus some large number to this to the power five, okay? So the idea is that if you can get reward of six, then all the previous rewards are meaningless for you, okay? And because there's a negative sign here, the odd numbers give me negative rewards and the even number gives me even rewards, positive reward. So you would like to get a reward which is as high as possible and even number. Once you get that reward, all the previous ones are meaningless. Because they are exponentially increasing in nature. So the way the, so where does the sort of, how do we account for the, how do we prove that the lower bound takes an exponential amount of time? Well, the way I'll encode this is the following. Remember there are, at each state, I have several actions that I can take, okay? So in our construction, each state will have either one action or two actions available to it. When a state has two actions, I'll number them as zero or one. Corresponding to zero, one. And a policy will now just tell me which of these, should I take the zero action or the one action, okay? And what I would like to do is that the, I like to take my algorithm through a sequence of steps, which will correspond to going through all sequence of n bit numbers. So let's say that there are, let's say, focus on the B's, the vertices labeled B's here. And remember there are two actions here. You can take an action zero or you can take an action one. So we will go through the details of this in a minute. And similarly, through this state, I can take two actions. I can take a zero action or take a one action, okay? So let's say you take zero action and this will correspond to the, to zero, zero, okay? Then you'll see that because of the other things, the next state, the next set of, next policy that you'll have will be zero one, okay? You'll take action one here, zero here. The next policy that you'll get will be one plus this. And the next one will be one plus this and so on, okay? So we'll take you through all the two to the n policies, okay? So we'll start with all zero policy. So if there are n of those, if there are n vertices which are like this. So here there are just two of these vertices, B1 and B2. In general, I'll have n of these vertices and they will take me through. I'll show that the simplex algorithm will take me through a sequence of two to the n steps and in the ith step, I'm at a number corresponding to the setting of these n variables. Then at the next step, I'll be at a step which will be one plus the current setting of these, did that make sense? So let's say you're at zero, zero right now. In the next step, you'll be at zero one. In the next one, we'll add one zero and then at one one, okay? So take me through all the two to the n setting of these variables, right? So let's do this, okay? So let's see why this happens. So suppose this is your current policy, okay? What's the policy? The policy tells you that which of these two actions you will take, okay? So let's say at this vertex, you take the action level zero, at this vertex, you take action level zero and so on, okay? So there are two levels to this, okay? So in general, there will be n levels where each level will look like this. So just drawn two levels, okay? And if the state of this level is zero, that means that everybody is taking action which is level zero. Similarly, everybody here is taking action which is level zero, okay? Now, how can I improve my solution? If you think about it, if you look at this state B1, right now, it's going directly towards T, it's getting zero amount of benefit. If it takes action one, it will get six amount of benefit and then it will go to T, right? So that's an improvement step for it. Similarly, if you look at this state, if it takes this action one, it will take it to, it will give it ten percent benefit, okay? All other states will not get any benefit from such a thing, okay? So let's say, remember the current state is zero, zero. And so you want to go to zero, one, okay? Because zero, zero plus one is zero, one. So what I'll do is, I can pick either this improving step or I can pick this improving step. So let's say I pick the lower one, okay? So I pick this improving step, okay? So this becomes one. Now if you see, once you do this, this also becomes an improving step, right? Because the vertex U1 will realize that currently it is getting zero amount of profit, but going through the six it will get better, okay? So all of these lower levels, they switch. And this becomes one, zero, okay? Zero, one. Now what happens in the next step? Next step you see that from zero one you want to go to one, zero, right? Because zero one plus one is one, zero. Now what happens is, again this B2 vertex realizes that I can get ten units of money, but it's just switching from zero to one, okay? So it does that. And similarly these things will switch as it happened in the previous step. Now the nice thing is however, look at this vertex B1 now. It realizes that right now it is getting six amount of money, but ten is much bigger than six. So now that B2 takes me to ten, I can take the zeroes action and go and collect this amount of money, okay? So that's what it does, okay? So it takes the zero step and collects this money, okay? I'm not showing you all the details of this construction. But the idea is that you design a gadget which allows you to go through all of these states where a state basically corresponds to a set of actions for each vertex, okay? And you cycle through all the two to the n states. Now the only thing to notice is, even if you didn't follow this, is that for this to work, the following is required, right? That you require that if there are several possible switches, you pick the one which suits you, right? So for example, in the previous one, if you wanted to go from 0, 0 to 0, 1, you would rather like to switch from this 0 to 1 rather than from this 0 to 1, right? You don't want to take this step, you want to take this one, okay? Blank's rule, this improvement corresponds to Blank's rule. So if you just follow Blank's rule, you will see that this is what you will do. But what about the random pivot rule? Random pivot rule says that you can pick either of these two with equal probability. So if you pick this one, then you will not follow that sequence, okay? So what happens in the random pivot rule? So for that to happen, we need to add some more gadgets. We want to say that the way this will do is, I'll skip this, that you will replace each of these vertices by a gadget. The gadget what it does is that it delays the switch from this state to this state, okay? How does it delay? It adds a cycle here, okay? It says that for this switch to happen, several switches will need to happen, okay? So right now think of replacing this vertex by a path here. And if you want to switch from this policy to this policy, all of these blue edges must switch pointing upwards, okay? So you add a much longer cycle here, okay? And now remember that two different switches were competing with each other. For example, you want one of these to happen before the other one. So how do you build that? What you say is that there will be two cycles, one corresponding to this vertex, one corresponding to this vertex. And if you want this one to happen before this happens, what you will do is that in the gadget, this will be a shorter cycle than this cycle, okay? So this is what we will do, we'll design a long path here. And for the switch to happen, I want all of these three edges to point upwards. And I will design for various vertices, different vertices, these paths will have different lengths, okay? So and then it's easy to check that if there are two chains, one of length Li and the other is of length Li plus 1. If you want this chain to close before this chain closes, then it can be arranged with high probability provided the chains are increasing in size. So there is some probability which tells me at what ratio these chains should increase, okay? So let's not go into this, okay? So this is how the construction looks, that it's exactly as before. But now instead of single vertex, we have a chain here. And what that chain does is that it adds a random delay to the switching, okay? And by allowing these chains to be of different length, you can make one, the event that you really want to happen before the other one happens, okay? So there's a long chain here, there's even a longer chain here and a longer chain here and so on. So that's what that happens, okay? Let's not go into the details here. Now, however, this is not all because let's say that you're building this chain and you want all of these to point in the up direction. However, let's say one of these points in the up direction and the other two do not point in the up direction. So what you would like to do is that you would like to reset this chain so that you come back to the zero state. Zero state corresponds to the case when everybody is pointing in the forward direction. So for that, they need to build another gadget. I don't have time for that, which allows you to do all this, okay? So this is how the final construction looks like. But I guess the main point is that the connection with MDPs allowed you to look at very complex polytopes, which were not possible if you're just looking at cube or some modifications of the cube. And you can encode actions as binary strings. And you can design the set of actions so that it takes you to go through all of these sequences. So this is the final theorem that they have. Just have one slide for, so if you really look at the linear program, this is how the LP looks like, okay? So it's a pretty complex LP and it would have been hard to come of this directly. But because one can go through the MDPs, one really does not have to think about the MDPs at all. These are all very big open problems. Can one give a strongly polynomial time algorithm for solving MDPs? Of course, we don't know a strongly polynomial time for solving LPs, but the MDPs are special kinds of LPs. So that's a big open problem. The other problem is, of course, prove or disprove the Herschel lecture.