 I am going to talk about expected window min-pay off, this is a joint work with Benjama Borde and Jofraswar Asuka. So, I start with the notion of classical min-pay off, so essentially given a sequence of infinite sequence of values, we define the min-pay off over this infinite sequence as limit over some n sequence of n values and when n tends to infinity. And so the problem that min-pay off suffers from is that, min-pay off does not guarantee local stability. So, for example, if the min-pay off value is something like lambda, then it is possible that for arbitrary long infixes, the value is actually far away from this min-pay off lambda. So that is why we have this notion of window min-pay off, which is a more stronger notion than classical min-pay off, where we kind of consider a window which is sliding over this infinite sequence and what we want is that for each of this local window which is sliding along this infinite sequence, we want this particular threshold to be satisfied. So you have this infinite sequence and we have a particular window which is sliding and for each of these sliding windows, we want that threshold value is satisfied. So it is kind of a strengthening of the classical min-pay off objective in the sense that if the window min-pay off objective is satisfied, so if for every such sliding window the value is greater than or equal to some lambda, then it implies that the value of the min-pay off, the classical min-pay off is also greater than or equal to lambda. To give you an example, so the good window property is essentially what I mentioned that like so for every window of length l, we want that this particular threshold lambda is reached. So here is some sequence of value, sequence of some weights and the way we define our window min-pay off given this window length l is that so for some length k which is less than equal to l, this we want to achieve the particular value of lambda. So for example here l equals 3, we are looking at windows of length either 1, 2 or 3 and looking at each position what is the maximum that we can achieve. So for example at the starting position like if I look at just window of length 1 it is 3, for 2 again it is 3 because 3 plus 3 over 2 gives 3 and then if I look at window of length 3 which is given by this l here, then I get the maximum value which is 11 over 3. If I look at the second position then I get this value of 4 but with some k which is only 2 like so I am looking for window length which is like either 1, 2 or at max it can be this value this is l. So this is the definition of our window min-pay off so essentially given this l I want to achieve this particular value of lambda within l steps not exactly at l steps but within l steps. Yeah. Yeah. So for example you can consider that like I mean a practical example maybe that you have a bank account and then well there may be a requirement that within the bank I mean your account balance should not go below certain amount for some amount of time. So it is not for an arbitrary period of time and then you get some high value but within like every year you have to like within certain window you need to maintain this particular balance. So we are looking at different varieties of this window min-pay off function so given a particular infinite sequence we first look at the fixed window min-pay off function so which means that I am given this window length l and I compute what is the particular value that I obtain for this for this sequence of pay off. So I have to essentially ensure that I get this value for at every position and in the fixed window min-pay off it is a prefix independent property which means that beyond a particular position k I should be able to see this value for every position beyond this position k. The bounded window min-pay off is again a prefix independent version where the length l is not given but we say that like whether there exists a window length l such that I can ensure a value of lambda. So the window length l is not specified but we are asking about whether there exists a window length l so that we can ensure a value of lambda. And we look at also direct window objectives which are not prefix independent but you have to ensure the value of the window min-pay off starting from the beginning of the sequence. So similarly so similar to the fixed window min-pay off we have direct fixed window min-pay off and we also have direct bounded window min-pay off objective. So some properties we can see immediately is that if l increases then the window min-pay off that we can ensure it is like so far if I have like a window min-pay off l plus 1 that is greater than equal to what I can ensure with a value of l because as I was saying that I want to talk about this particular lambda I want to see within l step. So if I increase my l then what I can ensure is it becomes at least the value that I had for a smaller value of l. And the value that I have for the bounded window min-pay off like I asked whether there exists a value of l so this is essentially a supremum of this different values of l and this is over which I can compute the fixed window min-pay off. So we are going to study these functions over Markov decision processes. So what is a Markov decision process? It's defined by a tuple where s is a set of states, e is a set of edges, act is like the set of actions. So here for example in this diagram the actions are like this either this brown action or a blue action the green action so this is how I these are my actions and a senior is like an initial state in this particular example it is s0 and w is like it's a function which assigns weight to the edges and p is a probability distribution so it maps a particular state and an action to the outgoing edges. So for example if I say I choose the action brown from s0 it assigns probability 0.5 to this edge and probability 0.5 to this edge okay and what is an end component and end component is a sub-MDP it's essentially this itself is an MDP and that is strongly connected so that is an end component and the maximal end component is 1 which is not included in any other end component and so this is something which is well known that every infinite path in an MDP almost surely so as Akshay described in the previous talk like so with probability 1 almost surely means that with probability 1 if you consider any infinite path in an arbitrary MDP which does not necessarily be which is not necessarily an end component an infinite path with probability 1 will end up in one of the end components so that is so this is something we will use later and we are also going to talk about strategies in an MDP so what is a strategy so essentially a strategy can be considered as like given a sequence of states so which is a history we assign a distribution over the set of actions such that the support of this distribution is essentially a subset of the set of actions that are available at the last state so for example if I see a sequence of states which ends at say s1 then a particular strategy will be say choose this action with probability 0.4 and this one with probability 0.6 okay so it's like a distribution over the set of actions that are available at the last state of this sequence okay a strategy is said to be deterministic if I choose one fixed action like so if I see if I have a sequence of states and like I choose one particular action from the which is available from the set of actions that are available at the last state okay so this is like a deterministic action and a memoryless deterministic strategy is one in which I just need to look at the last state so I don't need to look at the whole history and similarly we can also talk about finite memory strategies we are going to use them here so in a finite memory strategy we can like the strategy uses only a finite length history okay so that is a finite memory strategy okay so there is one property that we will again be using which is that inside an MEc if you consider any arbitrary state okay inside a maximal end component there exists a deterministic memory less strategy okay there exists a deterministic memory less strategy to reach another state of the MEc almost surely for example from here you can reach any of the other state let's say here you can read from h0 to s6 by a deterministic memory less strategy okay so if you are in an inside and inside and maximal end component you can reach another state of the maximal end component by a deterministic memory less strategy okay so now the mark of chain is essentially what in the previous talk Akshay mentioned as the level labeled mark of chain here so the idea is that in an mdp if I if I fix my strategy then what we obtain is a is a mark of chain so for example here I choose the brown action and then here I again like from s1 I choose the brown and from here the blue one like this and then so if you see that when I come from s5 I have another state which is actually this s1 but in some sense it stores where I am coming from so it's kind of stores like keeps track of the memory so it is so I call it s1 prime and it says that okay if I have come from s5 then instead of taking this brown action I will be choosing the green action and I will go to s6 okay so so the state space of the mark of chain essentially it kind of kind of also stores stores the strategy the memory of the strategy so in a mark of chain what happens is like we have a bottom strongly connected component which is like a strongly connected component from which there is no way to escape and a path in a mark of chain reaches a bottom strongly connected component almost surely okay so now I am going to define what I mean by the optimal expected value in the mdp it's essentially so once I choose a strategy I have a mark of chain I have an expected value corresponding to the strategy the optimal expected value is like for a given function the value that I get corresponding to the strategy which maximizes the expected value okay so it's like the supremum over all these strategies the expected value that I get I call this the optimal expected value and the corresponding strategy I call it an expectation optimal strategy okay so this is so far like and and and mdp can also be considered as a weighted two player game where we just kind of forget the probabilities and player one essentially chooses an action and player two can choose given this particular action what is the next state so again it's similar to the two player games that we saw in the morning today but the representation is a bit different so okay so with the definitions I am going to now introduce the problem that we study here so we have looked at all these different objectives like the fixed window mean payoff bounded window mean payoff and the direct variance and what we want to study is compute the optimal expected value for these window functions given an mdp okay and these problems I mean they have been studied in the context of two player games by chatter J et al earlier and so we start with the fixed window mean payoff function so the idea is that given an mdp and the length and a window length l we want to compute the optimal expected value for the fixed window mean payoff function for for for this mdp so we assume that l is bounded polynomial polynomially by the size of the mdp essentially it gives some notion of stability within a reasonable period of time so we don't want the window length to be too large okay and this is a prefix independent objective if if okay so now here is an example so consider that the window length is three and if I start from s0 and suppose if I play blue okay if I play the blue action then what can happen is that infinitely often I can see this sequence of edges from s0 okay and hence my window mean payoff my fixed window mean payoff which is actually a prefix independent function it will give me a value of 0 because infinitely often I am going to see sequences of values which are of which are of length l and their average is 0 okay whereas from s2 again it is 0 if blue is played in 0 right I mean you can see exactly the same sequence here like and this gives a window mean payoff of 0 whereas if in s0 the brown action is played then what can happen is like okay I choose this one and then I see this and then starting from s2 I can see a value of 2 over 3 okay so now similarly I can see that for state s1 I have a value which is like 3 plus 6 over 2 from s4 I can see something like 0 1 2 for given given a window given a window of length 3 I can see the sequence which is 0 1 and 5 which means which gives me a value of 2 from s4 from s5 also I get a value of 2 and from s6 note that like I will choose this one but not this edge but I will choose this edge so you so what we note here is that the strategy that will be played from from the states of the mdp it's not necessarily going to be a memory less strategy because when I'm coming from s5 I'm going to take this action so that the value from s4 that I see is kind of I get a value of 2 whereas from here I'm going to take this edge which will give me a value of minus 2 plus 3 plus 6 over 3 okay so this gives me like this value so so the point to note here is that that in this kind of objective we essentially need memory okay and so as this is a prefix independent objective and every path every path almost surely that is with probability 1 is going to end up in an emissive maximal end component we are going to restrict our attention first only in the emissives of the mdp so let's just consider an emissive single emissive and see how the behavior is within the emissive okay so essentially within the emissive so if I choose a particular strategy so there is an emissive and I have a particular strategy which gives me a mark of chain and so a path in this mark of chain it visits every finite sequence of states of length l almost surely infinitely often okay so you have the mark of chain and then the mark of chain there are these bottom strongly connected component and in the bottom strongly connected component almost surely infinitely often every path of length l is going to be visited infinitely often okay in particular the worst sequence of such states okay the worst sequence of such states is going to be visited infinitely often so this kind of gives us the value the worst sequence of such of the states or of the values so in some sense it can be so you have so there is a strategy that has been chosen and we get a mark of chain and we get to see the worst sequence of sequence of values infinitely often so this has some kind of flavor of a two-player game okay so the so the adversary so the adversary is going to choose this worst sequence of of of weights and so the idea is that given an mdp for each state in the mdp we solve the two-player game we solve the two-player fixed wind domain payoff game and then we kind of choose the state for which we get the best value and within the mec within the end component we have a strategy to reach that vertex almost surely right so as I said earlier that within an end component given a particular vertex you can always reach almost surely any other vertex right so the best vertex that I have I can I try to reach there and play play essentially from there so that is the idea so for example here if I try to play for example the the brown action I have a value of one but here I could I we saw that we have a value of two so the idea is that like we solve the two-player game in each of the vertices in the maximum end component and then we play the action so that we almost surely reach here and then we follow the strategy here okay so this is the idea inside the end component and then the remaining part is like how do we reach in some sense the best end components okay so suppose if I have like so I have this end component this may be a full arbitrary mdp I have the end components I kind of get the values for each of these end components and then whatever is the value that I obtain like the best like I had the best value for this vertex I would call this say lambda m1 and for this similarly if I have lambda m2 and then I kind of compact each of these mdp's ms into a single vertex with a self-loop with these weights and then we get a new mdp right and then the problem kind of boils down to finding like the expected classical mean payoff in this in this reduced mdp okay so this is the idea and we also have a relative hardness result which says that the problem is at least as hard as solving the two-player fixed-window mean payoff game okay so this is the relative hardness result and okay so I'll just so this is stated a bit more formally so then we study the direct fixed-window mean payoff objective and we show that this problem is much harder so in the fixed-window direct fixed-window objective we cannot restrict our attention only to the maximal end components because we are looking from the beginning of the path and what we show that this problem is much harder and this is in fact p-space hard okay so deciding that whether the direct fixed-window objective is above a particular threshold this problem if p-space hard and we do a reduction from the threshold probability problem for stochastic shortest path okay so I will skip the reduction here okay and for solving the problem so essentially for the membership what we do is the following that we construct a new mdp which kind of maps each infinite path that keeps track of the minimal mean payoff that that is encountered over the window of length l so I we have the original mdp and we have the paths in the original mdp we construct a new mdp which keeps track of what is the minimum window mean payoff that I have seen so far and this new mdp is exponential in the window length l and yeah and this problem can be solved in exponential time exponential in the window length l with exponential memory strategy so we talked about finite memory strategies and here the memory requirement is like exponential memory is report so here is a summary of the results that we have so we also solved the I didn't talk about the bounded window mean payoff essentially the bounded window mean payoff instead of solving the two player two player window mean payoff game it requires us to solve the two player classical mean payoff game which gives us this up intersection co-up complexity and similarly we also have a relative hardness result so we can show that solving the bounded window mean payoff problem is no easier than solving the classical mean payoff problem okay classical two player mean payoff problem and we can also show that when it comes to direct bounded window mean payoff it's exactly the same as like over a sequence of weights over a path the value of the direct bounded mean payoff is the same as the value of the bounded window mean payoff and we also solved the problems for the special case of Markov change we can show that for the bounded window mean payoff it's polynomial and for the direct window mean payoff it's no longer exponential but it's pseudo polynomial in the weights that appear in the Markov change so one problem in this definition is that like when we enter the when we enter the or one limitation is that when we enter the end components then the probabilities somehow are not so important so it's kind of independent the value that we get we are playing a two player game and the value that we get is independent of the probabilities so there can be other definitions other notions of window mean payoff as well in which the probabilities inside the maximal end components also play a significant role so we have some definitions and we have some initial results so this is the thing that we have been working on now so yeah so I think I will stop here okay thank you so something you might ask about is like what if you only want a one minus the flow one fraction of your windows to satisfy this threshold thing so you are saying that one minus delta fractions of this window satisfying so so what we are doing here is like so if I look at the fixed window mean payoff so essentially it's like within certain within so first of all it's a prefix independent objective so we can restrict our attention to the end components okay so some of the end components are good and some of the end components are not good and so whether it is good or not it depends on like what we what we observe by playing a two player game inside the maximal end inside rather each vertex of the maximal end component and if there is one vertex which turns out to be good then you can reach that particular vertex from any other vertex by by by a deterministic memoryless strategy almost sure yeah I feel it will happen like it will essentially we see that but the problem is not in your role because yeah even if there is a non-zero probability like really low you are going to see that kind of sequence infinitely often because you are in a strongly connected component is it true that if you say it often then in fact you say it with you know in a positive time yeah like yeah with probability one you are going to case to solve the mdp case okay would you push that because of so one function would be like to instead of considering the mdp rather like can you consider window mean payoff over for stochastic games okay so this is perfectly a valid problem I think but we haven't really we are not we haven't really started working on that but but but related to that as I was going to say that a future work is essentially coming up with some definitions of window mean payoff which are there in some of the appendix slides that we have some definitions of different kinds of window mean payoff where we cannot any longer have a two-player game inside the maximal end components so you are talking about like multiple objectives you are talking about multiple objectives or like or you can say that well with probability something I want a window mean payoff and then I also want something else in the worst case yeah so this kind of I think objectives in conjunction can be studied as well yes but yeah this is not so thank you