 Okay, so compared to the framework of the last talk I'll be talking about partially observable Markov decision processes So the framework is simpler in two respects because I only have one player and the Markov decision processes we consider at discrete time, but it will be more complex in one Component which is partial observability. So this is a joint work with Blaise-Jeunet So I need to find out all these words So after a short introduction to the model of partially observable Markov decision processes and the problem would be considering on them I'll study two different problems Optimizing the worst case costs and optimizing the average costs and then conclude. So let's Explain what is the model we're considering. So first start with Markov decision processes So this is a model where as I said you only have one player and is Responsible for making the non-deterministic choices. So for example here in state one There is only okay. There's action a which is available. So the player can choose action a Action B is not specified, but you could think of it as going to some losing state some seeing state and after that nature or A random player chooses what would be the successor according to the probability? So with priority one-third the successor will be one and also with the same probability the successor can be state two or three and a strategy for the controller of the scheduler is simply Based on the actions and the states that are visited so given a sequence of states and Actions altering alternating states and actions then the controller can choose a distribution of a possible enabled actions so we will be interested in this talk in a Reachability objectives and more precisely what we want to look at is Strategies that ensure to reach a goal almost fully so with priority one So in this very particular example, there is a very simple strategy for To reach the goal the goal states here almost fully and It's very simple because it's memory less So it does not depend on the whole history, but only on the current state and it's pure So the decision is just Deterministic you just say I'll take this action and you don't need randomization so here Obviously if in state one you choose action a and then when in state two you choose B or when in state three You choose C then you will reach the goal almost really In this example So things become a bit different if you consider partial observation and here we will have a very simple Modelization of partial partial observation using partition so we partition the state space into several classes represented by colors here and and Still the goal of the player is to reach the goal okay, the objective of the player is to reach the goal almost fully and now a strategy for the controller is Cannot be based on the current state of the system But only on the observation it receives from from the from the system and the observation is Exactly the part of the partition where this system is in so based on a Sequence of observation an action he played it chooses the next Action that will be played so more precisely a distribution of a possible actions Okay, so it is quite Easy to see that on this example on this particular example There is no strategy that can ensure to reach the goal almost surely and basically the problem is that Being in two or three then the decision you need to make in order to reach the goal is different and if you make But not not the correct decision then you will end up in this state for which is Losing because you can never reach the goal again from state for Okay, so this framework has been well studied and even for more a more Elaborate conditions where you want to compute the probability to the optimal probability to reach a goal, but we will Consider a slightly different framework where you start from a partial observable mark of decision process And you add some additional action which was not present before which Which is a request to fall full observation so? It on top of the actions ABC that we have in the previous model We have this special request action that you can perform everywhere and which will disclose the precise state of the system So now the new observations are this constituted are this All the parts of the partition and also the individuals individual states and The strategy for the controller is now based on this new set of observations and also the new set of actions which I did not with the primes prime letters and Obviously if there's an almost sure winning strategy in the fully Full observation MDP then there will be a strategy in in this model But so here on this example there is An almost sure winning strategy, but what we're interested in is is there a cheap almost sure winning strategy? So I will make it precise in a moment. What what cheap means? so there are two natural ways of specifying this Cheap so first of all if you take a single path It's natural to say that the cost of a path will be the number of requests of these special actions that you make along along the path Now if you take a strategy for the controller then There are two post what okay there are at least two natural possibilities and the first one is to to say that the cost of the strategy is the worst case cost that you can encounter a Long path that follow this strategy and the second option is to say that you consider average cost for this strategy So the you want to look at the expecting number of requests using this strategy and So I know come to the problem statement. We'll be looking at We will try to find almost surely winning strategies that either optimize the worst-case cost or optimize the average cost as first start with worst worst-case cost and Okay, basically in the worst-case cost we can We can base the analysis on a belief states and even on discrete belief states So what are what are they so in general belief states are? represented distribution of a possible states the system can be in which which records what is a probability to be in Each of the states at a given moment Here we will only consider Not the distribution, but only the support of the distribution so the set of states that we can be in and So if we look at this example Where you have three different parts in the partition and say we start from state one So first of all, we know that we start from state one So now if we perform an a and receive observation blue So hopefully you can read this thing. So which corresponds to this pot. Then we know Like starting from one and performing an a could lead to either states one two three or four But if you receive the information that you are in a blue state Then you can refine your information and know that you are in either state one or two and you even know more than that you know that you are with equal probability in one or two because Exactly the same probability You have exactly the same priority to go to one or two But let's keep just the set and not the precise distribution Now if you perform an a and receive the the observation Yellow, then you know that the next belief is Either three or four and it cannot be five because there is no a action leading to five from either state one or two and last if you play a B and receive action yellow then You know that you will be in state five and you also know that in the previous state you were in state three But okay, let's just track the information that you were in state five of course, there are many path in this In this game and here is just another example where I used the request action just to make it clear So we again start in state one Then we play on a and receive observe observation yellow so we know that we are in other states three or four and Then we make a request and we can receive two information depending on where we stand either three or four And let's say we receive information for them. We know that the current state is foreign. We can update the belief to to this singleton state singleton sets and then playing a B and receiving a Blue then we know that we are in state one, okay So Using this we can build this all belief graph where Okay, probably you can't read this but it will be much bigger in the next slide and with the red ox Are represented the request actions, okay, that can from so here you see that from partition One two using a request action you can either end to state one or state two and the poll you will use this notation to so up of SAO represents the belief update when you start from some Set S you play action a and you receive observation. Oh Okay, so here is the graph we obtain and in the first step what we will try to do is To ensure to reach the goal almost surely so this is the easiest Rather easy part and then we will try to optimize the strategy in order to minimize the worst case cost Okay, so what shall we do? Well first there are some states From which you cannot win even with fully full information. You won't be able to to win So on this example, this was the case of state five Where even if you use all the red edges you want you want you will never be able to reach the goal So these states will be discarded. I mean you can't win from them now For the other states so the other beliefs that are not losing We will split this state space into three disjoint sets so the first one W okay is The set of beliefs where you know that you already won so this is on this particular example is the case of this only location So here while the strategy has nothing to do because you already we already won now we also specify a set of winning states W rack which From which you need to perform a request Immediately otherwise you you have a chance to lose so I can this be expressed well It's rather simple. It's the set of belief s such that what are the action you take? There is a possible observation such that the updates The update state is now losing So if we look at this example here and start from this belief free for If we play a or B So there's an age here leading to five and also an age leading to one But in both cases being a or B if we receive observation yellow then we go to the losing state So if you remember the the game we started with it's rather natural. It just says that It's not safe to play a or B when you are in a in belief state free for okay and the last part of the of this Partition is W safe, and it's just the remaining part and Let's now see what the strategy can do According to those sets so we With not we don't take only one strategy, but a family of strategies and And which basically will differ on how often they perform requests So we've seen that in W rec It's compulsory to perform a request immediately So in this first set then the strategy chooses to play a request Otherwise, there's a chance to lose and you will never achieve the goal almost surely Now in the safe states What we do is to play a request with a small probability Here probability one of the end and then play uniformly over all safe actions So what is a safe action a safe action is exactly what's not here for for the request set It's an action such that For all observations, then the update is not in lose Written here. They're small, but So it's quite easy to observe that For all these strategies are almost surely winning if you start of course if you don't start from the losing region If you start from the winning region okay, so However, okay, sorry, maybe I can come back to the example here So if you remember what we do in so in the area, which is neither green orange or red What we do is perform a request with a small probability But still perform a request with party one over M So here on this example, you can see that this is not optimal because from state two If you play a B Then you win directly. So there's no point in doing in playing a request when you're in this state so what we will explain here is Try to optimize the strategy in order to make the fewest request as possible what we will compute is An increasing sequence of sets such that in SK K requests you need at most care requests to to win and If a state is in SK, but not in SK minus one. It means that there is a path For all almost only winning strategy. There is a path that uses that needs K requests to to to win Okay, so if if this rectangle represents the state space The belief state space. Well, first of all, we have already have this losing states We know that we can't win from them from there And we also have these okay states where we know that we've already won So how do we compute first as zero so in in the example I had before state two belief two was Should be in a zero because you know that you don't need it any request to win from there So well as zero is just Can simply be computed as the set of states from which there is a strategy reaching the goal Which is okay with priority one without using a request. So we start from the belief Belief my condition process We remove the red edges and we wonder whether we can reach the green area Without any request. Okay, almost surely and without a new quest is because I removed the the red edges in several steps and this can be this can be computed by I mean, it's a standard question for for markup decision processes and it Doesn't have anything to do with our framework and so The slightly optimized strategy starting from the the canonical ones We add is to say okay from as zero from states from belief states in a zero I don't I don't perform this request. I only play uniformly other safe actions Now how do I compute s1 so in s1? Okay, I first define a set L1 which represents the belief states where All individual states in in in this set are in a zero Okay, so remember that I had this special action this reconstruction that allows me to ask Okay, what is the precise state? So if I'm in L1 and if I make a request I will end in either of the states that compose my my set and if I know for that from each of this set Singleton set I can win then I'm fine So of course L1 is a subset of s1, right? Also as zero is a subset of s1 because we we are building increasing sets So what we take for s1? We can take a bit more. We can take all the states that are able to reach with priority one The set L1 or the set as zero without using request just at before, right? So now how to optimize a bit more our strategy. Well, what we do is that in L1 And not in as in as zero because in a zero, you know, don't need to perform a request in L1. We perform this request and Otherwise we just ensure to stay in s1 for example with a uniform distribution of actions that that are safe with respect to this goal and Okay, basically at some point we will reach L0 or sorry L1 or as zero and then either perform a request or win directly without request Okay, so this was the first iteration, but Basically, you can replace here 1 by I plus 1 and 0 by by I and perform this iteration again And at some point because the state space is finite. You will stabilize. So at some point Sn is equal to Sn minus 1, but still this does not mean that you have covered all all winning states So we denote by S infinity the remaining part and we can prove that in an S infinity Whatever optimal strategy, sorry, whatever almost surely winning strategy you take there are some in there Some path where you need infinitely many requests. So in some sense This partition is optimal because you can prove that starting in I'll try to say it again starting me in S I in some state in S I you you might need I request to reach the goal to reach the goal Okay, so this way we can compute the minimum worst case because depending on where your initial state belongs You get the minimal the optimal worst case and together with a strategy that ensures this This optimal and it's in X time. Basically. It's polynomial polynomial time in the size of the of the MDP we consider and the market decision Process we consider is the belief one. So it's exponentially the size of the starting starting Decision process Okay so on there is I mean on some particular examples, of course worst case cost Optimizing the worst case coincides with optimizing the average cost, but This is far from from being the general case. So let's go have a look at what we can do for average cost So yeah, I will use game technology and speak of the value of a game or Pumdp as the infimum of average cost that over all almost surely winning strategies and the first then uses that the value cannot cannot be computed. So Okay, which can be formulated like this Whatever constant you take you cannot you cannot tell whether value is smaller than this constant or not. It's not that surprising because already optimizing cost functions for Partially observable market decision processes in is undecidable, but still we have a very special case of of Pumdp So we could have hoped that's that still is decidable So we start from this problem, which is known to be undecidable, which is you take a Pro-istic finite automaton, which is very particular. It's Other accept all words with a lot with sorry with a very small probability with priority smaller than epsilon or there is some words at least one word which is accepted with very high probability and You cannot just you cannot tell which case holds, okay So we start with this Pro-istic finite automaton and this diet of the emptiness problem So P here is the Pro-istic finite automaton and we turn it into a game as we have with Pumdp So we add some states here goal sink and those two test states And basically what happens is that in the final states if you perform a shop this new some new action Then you reach the goal, but if you do it from another state here then with pretty one half you end in a and TA and one half in TB and basically in TATB happens the same phenomenon as states three and four before Which was that from TA you need to play an A in order to be able to start to try you to try again Well, whereas in TB you need to play a B and if you do the reverse The reverse choice then you end up in a sink state from which you will no longer be able to reach the goal so what we can show is that the the Pro-istic finite automaton accepts a word with high probability if and only the value is Small okay, so I will not detail how it works, but it's rather rather obvious and Well, what we can even prove is that the best best approximation. So the value that is computed compared to the real value of the game as This shape and so the approximation factor Tends to infinity when epsilon epsilon tends to zero so we can refine the statement we had we had before and say that What are the approximation factor you take then you you cannot? Approximates you the value within this approximation factor But to do so you need to have bigger and bigger objects Okay, in order to increase the the precision the Approximation factor within which you cannot approximate You need bigger and bigger objects so we can do more than that and in fact what we can do is Kind of show that I think good approximations is NP hard and good approximations means that The Okay, the state space of the of the game is rather small. So here it's quadratic in some parameter and also the belief The reachable belief are few so again quadratic in some some some parameter and yet the approximation factor is exponential is okay Close to exponential in in this factor and So I insist that okay, I Insist here that what is difficult here is to show that You take your PMDP and even if there are few reachable Belief states then the approximation error can be large Okay Of course, which is I mean the statement holds assuming P is not equal NP and yeah, okay, so Let's I'll try to explain a bit how it works. So we start so we would reduce free SAT so we start from a free SAT formula and Which has a given number of clauses M and a given number of variables K and Our permit parameter and will be the product of those two values. Okay, so N is equal to MK So, okay recall that pi is satisfiable if for every clause you can choose a literal Which will be set to true and there there is no conflict between the literals You've chosen to set to true. Okay, if you take two different clauses and take li here and lj there Then there's no conflict between those choices So basically what will the okay, what will the partial observers? PMDP Game will look like well first of all there would be a random choice on the variable you are you will be monitoring and then You will be punished each time you want to set. I mean if each time there is a conflict For this variable you you you will be punished and being punished means means Being forced to make a request So increasing your cost Okay, and the Reduction ensures that if the if the formula is satisfiable then the value is Smaller than N than N. Whereas if it's not satisfiable, then the value is very big It's greater than some exponential in N So just a very brief Int on the reduction so we start in some initial state and as I said there's a random choice of a The variable you will be monitoring so here I don't know if you can read this that this is C 1 1 C 1 I to C 1 K and remember we have K variables So basically this C 1 I means I will monitor variable X I and I will check that you don't at some point Say that X I all and then that X I doesn't hold so now The nice thing would be to say okay if at some point you set X I to true which is represented here I will store this information and store it in a In a state what I remember that you said that X I was true Same if you said that X I was not true Then I will store this information and if you say something about a variable Which is not of interest to me about X J Then I'd I just don't assign a value to X I right and we continue like this And if at some point there is a conflict so we are in a state where we remember that X I is true and you tell me X I Bar so X I is not true Then there is this conflict and we'll go to this test gadget where you need to perform a request in order to be able to Try again the whole process Okay, so this is very fine that the problem with this is that you have too many belief reachable belief states so what we do just to To okay, sorry, I did not go to the if if no conflicts are detected then you reach the goal state here Okay, and and you're fine. So the problem with this reduction is that you do not have few Reachable belief states and in order to do this we blur the information So what we say is that instead of going if I set X I to true instead of going to this state where I store X I is true With pretty one. I will go there with pretty one minus epsilon and to the other states those two with priority say epsilon over two and in in this way that the control is not able to know what's what state here is Riched so this there will be the same belief states then if some of us choices made Okay, so this was a very brief idea of it works. Let me jump to the Conclusion, so what we've looked at is try to minimize the number of requests to full information in PUM DPs and The the objective is to reach almost fully a goal in the worst case cost You can compute Okay, sorry, this shall be this shall be X time you can compute in X time If sorry the worst case cost so it's P time in the number of belief That seems expanding the number of states excuse me for that Together with an optimal strategy and in the address for the average cost then it's not decidable to To compute that you cannot compute the value and you can even not approximate it and we somehow Characterize what is the size of the least approximation factor in in terms of the size of the of the model we start with So there are a few future works that could be in this in desigated So first thing is here. We have a very simple framework work where you request for full Information it could be interesting to have different levels of information and and say different costs for requesting to these different levels and see see what happens there and Another question would be to not to ask for reaching a goal almost surely, but maybe with us slightly lower probability, but then allowing for more sorry for less Less requests so find a trade-off between the objective the the reachability probability and and the cost Thank you for your attention