 Okay, so compared to the framework of the last talk, I'll be talking about partially observable Markov decision processes. So the framework is simpler in two respects because I only have one player and the Markov decision processes we consider at discrete time. But it will be more complex in one component which is partial observability. So this is a joint work with Blaise-Jeunet. Okay, so I need to find out how this works, but... So after a short introduction to the model of partially observable Markov decision processes and the problem would be considering on them, I'll study two different problems, optimizing the worst-case costs and optimizing the average costs and then conclude. So let's explain what is the model we're considering. So first start with Markov decision processes. So this is a model where, as I said, you only have one player and is responsible for making the non-deterministic choices. So for example, here in state one, there's only... Okay, there's action A which is available. So the player can choose action A. Action B is not specified, but you could think of it as going to some losing state, some seeing state. And after that, nature or a random player chooses what will be the successor according to the probability. So with priority one-third, the successor will be one, and also with this same probability, the successor can be state two or three. And a strategy for the controller or for the scheduler is simply based on the actions and the states that are visited. So given a sequence of states and actions, alternating states and actions, then the controller can choose a distribution of a possible enabled actions. So we will be interested in this talk in reachability objectives and more precisely what we want to look at is strategies that ensure to reach a goal almost fully. So with priority one. So in this very particular example, there is a very simple strategy to reach the goal, the goal states here, almost fully. And it's very simple because it's memory-less, so it does not depend on the whole history, but only on the current state. And it's pure, so the decision is just deterministic. You just say, I'll take this action and you don't need randomization. So here, obviously if in state one you choose action A, and then when in state two you choose B, or when in state three you choose C, then you will reach the goal almost fully in this example. So things become a bit different if you consider partial observation. And here we will have a very simple modernization of partial observation using partition. So we partition the state space into several classes represented by colors here. And still the goal of the player is to reach the goal. The objective of the player is to reach the goal almost fully. And now a strategy for the controller is, cannot be based on the current state of the system, but only on the observation it receives from the system. And the observation is exactly the part of the partition where the system is in. So based on a sequence of observation and action it played, it chooses the next action that will be played. So more precisely a distribution of possible actions. So it is quite easy to see that on this example, on this particular example there is no strategy that can ensure to reach the goal almost fully. And basically the problem is that being in two or three, then the decision you need to make in order to reach the goal is different. And if you make not the correct decision, then you will end up in this state four, which is losing because you can never reach the goal again from state four. Okay, so this framework has been well studied and even for more elaborate conditions where you want to compute the optimal probability to reach a goal, but we will consider a slightly different framework where you start from a partial observable mark of decision process and you add some additional action which was not present before, which is a request to full observation. So on top of the actions A, B, C that we have in the previous model, we have this special request action that you can perform everywhere and which will disclose the precise state of the system. So now the new observations are all the parts of the partition and also the individual states. And the strategy for the controller is now based on this new set of observations and also the new set of actions, which I did not with the prime letters. And obviously if there's an almost sure-winning strategy in the full observation MDP, then there will be a strategy in this model. So here on this example there is an almost sure-winning strategy, but what we're interested in is is there a cheap, almost sure-winning strategy? So I will make it precise in a moment what cheap means. So there are two natural ways of specifying this cheap. So first of all, if you take a single path, it's natural to say that the cost of a path will be the number of requests of these special actions that you make along the path. Now if you take a strategy for the controller, then there are at least two natural possibilities. And the first one is to say that the cost of the strategy is the worst-case cost that you can encounter along path that follow this strategy. And the second option is to say that you consider average cost for this strategy. So you want to look at the expecting number of requests using this strategy. And so I now come to the problem statement we'll be looking at. We will try to find almost sure-winning strategies that either optimize the worst-case cost or optimize the average cost. I'll first start with worst-case cost. And basically in the worst-case cost we can base the analysis on belief states and even on discrete belief states. So what are they? So in general, belief states are represented distribution of a possible state the system can be in which records what is the probability to be in each of the states at a given moment. Here we will only consider not the distribution but only the support of the distribution so the set of states that we can be in. And so if we look at this example where you have three different parts in the partition and say we start from state one. So first of all we know that we start from state one. So now if we perform an A and receive observation blue so hopefully you can read this thing so which corresponds to this part then we know like starting from one and performing an A could lead to either states one, two, three or four but if you receive information that you are in a blue state then you can refine your information and know that you are in either state one or two. And you even know more than that you know that you are with equal probability in one or two because exactly the same probability you have exactly the same probability to go to one or two. But let's keep just the set and not the precise distribution. Now if you perform an A and receive the observation yellow then you know that the next belief is either three or four and it cannot be five because there is no A action leading to five from either state one or two. And last if you play a B and receive action yellow then you know that you will be in state five and you also know that in the previous state you were in state three but okay let's just track the information that you were in state five. Of course there are many paths in this game and here is just another example where I used the request action just to make it clear. So again starting state one then we play on A and receive observation yellow so we know that we are in either state three or four and then we make a request and we can receive two information depending on where we stand either three or four and let's say we receive information four then we know that the current state is four and we can update the belief to this singleton set and then playing a B and receiving a blue then we know that we are in state one. So using this we can build this all belief graph where okay probably you can't read this but it will be much bigger in the next slide and with the red arcs are represented the request actions okay that can from so here you see that from partition one to using a request action you can either end to state one or state two and the polio will use this notation to so up of SAO represents the belief update when you start from some set S you play action A and you receive observation O okay so here is the graph we obtain and in the first step what we will try to do is to ensure to reach the goal almost surely so this is the easiest rather easy part and then we will try to optimize the strategy in order to minimize the worst case cost okay so what shall we do well first there are some states from which you cannot win even with full information you won't be able to win so on this example this was the case of state five where even if you use all the red edges you want you will never be able to reach the goal so these states will be discarded I mean you can't win from them now for the other states so the other beliefs that are not losing we will split this state space into three disjoint sets so the first one W okay is the set of beliefs where you know that you already won so this is on this particular example this is the case of this only location so here well the strategy has nothing to do because you already won now we also specify a set of winning states W REC from which you need to perform a request immediately otherwise you have a chance to lose so how can this be expressed well it's rather simple it's the set of belief S such that whatever action you take there is a possible observation such that the update the update state is now losing so if we look at this example here and start from this belief three four if we play A or B so there's an age here leading to five and also an age leading to one but in both cases playing A or B if we receive observation yellow then we go to the losing state so if you remember the game we started with it's rather natural it just says that it's not safe to play A or B when you are in belief state three four okay and the last part of this partition is W safe and it's just the remaining part and let's now see what the strategy can do according to those sets so we don't take only one strategy but a family of strategies and which basically will differ on how often they perform requests so we've seen that in W REC it's compulsory to perform a request immediately so in this first set then the strategy chooses to play a request otherwise there's a chance to lose and you will never achieve the goal almost surely now in the safe states what we do is to play a request with a small probability or probability one over N and then play uniformly over all safe actions so what is a safe action a safe action is exactly what's not here for the request set it's an action such that for all observations then the update is not in lose it's written here very small but so it's quite easy to observe that for all these strategies are almost surely winning if you start, of course if you don't start from the losing region if you start from the winning region okay so however okay sorry maybe I can come back to the example here so if you remember what we do so in the area which is neither green, orange or red what we do is perform a request with a small probability but still perform a request with probability one over N so here on this example you can see that this is not optimal because from state two if you play a B then you win directly there's no point in playing a request when you're in this state so what we will explain here is try to optimize the strategy in order to make the fewest request as possible so what we will compute is an increasing sequence of sets such that in SK K requests you need at most K requests to win and if a state is in SK but not in SK minus one it means that there is a path for almost only winning strategy there is a path that uses that needs K requests to win okay so if this rectangle represents the state space the belief state space well first of all we have already had this losing state we know that we can't win from there and we also have these okay states where we know that we've already won so how do we compute first S0 so in the example that I had before state two, belief two should be in S0 because you know that you don't need any request to win from there so well S0 is just can simply be computed as the set of states from which there is a strategy reaching the goal reaching okay with probability one without using a request so we start from the belief belief Markov decision process we remove the red edges and we wonder whether we can reach the green area without any request okay almost surely and without any request is because I removed the red edges in several steps and this can be computed by I mean it's a standard question for Markov decision processes it doesn't have anything to do with our framework and so the slightly optimized strategy starting from the canonical ones we had is to say okay from S0 from belief states in S0 I don't perform this request I only play uniformly other safe actions now how do I compute S1 so in S1 okay I first define a set L1 which represents the belief states where all individual states in this set are in S0 okay so remember that I had this special action this request action that allows me to ask okay what is the precise state so if I'm in L1 and if I make a request I will end in either of the states that compose my set and if I know that from each of these sets singleton set I can win then I'm fine so of course L1 is a subset of S1 also S0 is a subset of S1 because we are building increasing sets so what we take for S1 we can take a bit more we can take all the states that are able to reach with priority 1 the set L1 or the set S0 without using request just as before right now to optimize a bit more our strategy what we do is that in L1 and not in S0 because in S0 we don't need to perform a request in L1 we perform this request and otherwise we just ensure to stay in S1 for example with a uniform distribution of actions that are safe with respect to this goal and okay basically at some point we will reach L1 or S0 and then either perform a request or win directly without request okay so this was the first iteration but basically you can replace here 1 by I plus 1 and 0 by I and perform this iteration again and at some point because the state space is finite so at some point Sn is equal to Sn minus 1 but still this does not mean that you have covered all winning states so we denote by S infinity the remaining part and we can prove that in S infinity whatever optimal strategy sorry whatever almost surely winning strategy you take there are some paths where you need infinitely many requests so in some sense this partition is optimal because you can prove that starting in I'll try to say it again starting in S i in some state in S i you might need i requests to reach the goal to reach the goal okay so this way we can compute the minimum worst case because depending on where your initial state belongs you get the optimal worst case and together with a strategy that ensures this this optimal and it's in X time basically it's polynomial time in the size of the of the MDP we consider and the market decision process we consider is the belief one so it's exponentially in the size of the starting starting decision process okay so I mean on some particular examples of course worst case cost optimizing the worst case coincides with optimizing the average cost but this is far from from being the general case so let's now have a look at what we can do for average cost so here I will use game terminology and speak of the value of a game or POMDP as the infimum of average cost that over all almost only winning strategies and the first bad news is that the value cannot be computed so okay which can be formulated like this whatever constant you take you cannot tell whether the value is smaller than this constant or not it's not that surprising because already optimizing cost functions for partially observable market decision processes is undecidable but still we have a very special case of POMDP so we could have hoped that still is decidable so I'll try to explain the proof briefly so we start from this problem which is known to be undecidable which is you take a probabilistic finite automaton which is very particular either accept all words sorry with a very small probability with probability smaller than epsilon or there is some words at least one word which is accepted with a very high probability and you cannot tell which case holds okay so we start with this probabilistic finite automaton and this dyont of the emptiness problem so P here is the probability finite automaton and we turn it into a game as we have with POMDP so we add some states here goal, sink and those two test states and basically what happens is that in the final states if you perform a sharp some new action then you reach the goal but if you do it from another state here then with probability one half you end in A and TA and one half in TB happens the same phenomenon as states three and four before which was that from TA you need to play an A in order to be able to start to try again whereas in TB you need to play a B and if you do the reverse reverse choice then you end up in a sink state from which you will no longer be able to reach the goal so what we can show is that the probability finite automaton accepts a word with high probability if and only the value is small so I will not detail how it works but it's rather obvious and well what we can even prove is that the best approximation so the value that is computed compared to the real value of the game as this shape and so the approximation factor tends to infinity when epsilon tends to zero so we can refine the statement we had before and say that what other approximation factor you take then you cannot approximate you the value within this approximation factor but to do so you need to have bigger and bigger objects in order to increase the precision the approximation factor within which you cannot approximate you need bigger and bigger objects so we can do more than that and in fact what we can do is kind of show that adding good approximations is NP-hard and good approximations means that the state space of the game is rather small so here it's quadratic in some parameter and also the belief the reachable belief are few so again quadratic in some some parameter n yet the approximation factor is exponential is okay close to exponential in this factor n okay so I insist that I insist here that what is difficult here is to show that you take a PUMDP and even if there are few reachable belief states then the approximation error can be large okay of course this is I mean the statement holds assuming P is not equal NP and yeah okay so let's I'll try to explain a bit how it works so we start so we would reduce free SAT so we start from a free SAT formula and which has a given number of clauses m and a given number of variables k and our parameter n will be the product of those two values okay so n is equal to mk so okay recall that pi is satisfiable if for every clause you can choose a literal which will be set to true and there is no conflict between the literals you've chosen to set to true okay if you take two different clauses and take l i here and l j there then there is no conflict between those choices so basically what will the okay what will the partial observes PUMDP game will look like well first of all there will be a random choice on the variable you are you will be requiring and then you will be punished each time you want to set I mean each time there is a conflict for this variable you will be punished and being punished means being forced to make a request so increasing your cost okay and the reduction ensures that the formula is satisfiable than the value is smaller than n because if it's not satisfiable then the value is very big it's greater than some exponential in n so just a very brief int on the reduction so we start in some initial state and as I said there is a random choice of the variable you will be monitoring so here I don't know if you can read this that this is c11 c1i to c1k remember we have k variables so basically this c1i means I will monitor variable xi and I will check that you don't at some point say that xi all and then that xi doesn't hold so now the nice thing would be to say okay if at some point you set xi to true which is represented here I will store this information and store it in a in a state where I remember that xi was true same if you said that xi was not true then I will store this information and if you say something about a variable which is not of interest to me about xj then I just don't assign a value to xi right and we continue like this and if at some point there is a conflict so we are in a state where we remember that xi is true and you tell me xi bar so xi is not true then there is this conflict and we'll go to this test gadget where you need to perform a request in order to be able to try again the whole process okay so this is very fine that the problem with this is that you have too many belief reachable belief states so what we do just to to okay sorry I did not go to the if no conflicts are detected then you reach the goal state here okay and you're fine so the problem with this reduction is that you do not have few reachable belief states and in order to do this we blur the information so what we say is that instead of going if I set xi to true instead of going to this state where I store xi is true with priority 1 I will go there with priority 1 minus epsilon and to the other states those two with priority say epsilon over 2 and in this way the controller is not able to know what state here is reached so there will be the same belief states then if some of us try this made okay so this was a very brief idea of it works let me jump to the conclusion so what we've looked at is try to minimize the number of requests to full information in POMDPs and the objective is to reach almost fully a goal and the worst case cost you can compute okay sorry this shall be X time you can compute in X time if sorry the worst case cost so it's P time in the number of beliefs but it's X time in the number of states excuse me for that together with an optimal strategy and for the average cost then it's not decidable to compute you cannot compute the value and you can even not approximate it and we somehow characterize what is the size of the least approximation factor in terms of the size of the model we start with so there are a few future works that could be in this investigated so the first thing is here we have a very simple framework where you request for full information it could be interesting to have different levels of information and say different costs for requesting to these different levels and see what happens there and another question would be to not to ask for reaching a goal almost surely but maybe with a slightly lower probability but then allowing for more sorry for less requests so find a tradeoff between the objective the reachability probability and the cost thank you for your attention