 And here we are. Welcome back. So the plan for today is to lay out the foundations of partially observable Markov decision processes so that tomorrow in the tutorial, we will see how to solve one specific example that is the problem of Bernoulli bandits in a Bayesian setting. So let me just recap the ideas and objectives for today and tomorrow, at least. So as you remember from the very beginning, any reinforcement learning problem can be cast in the notion of an agent interacting with an environment. And with a closed loop interaction, which is made of actions and percepts. And the percepts are actually split into a couple of pair of objects, rewards, and observations. So rewards are the part of the percept which pertain to the goal. We construct our goal starting from rewards. And observations give contextual information. So in general, observations are limited. So the state of the environment is often unknown. It's hidden to the agent. The agent has some knowledge about its environment, about the context only through partial observations. So in general, so this is a very general setting in reinforcement learning. What the agent has in its own availability is a series of actions, rewards, and observations. So in general, at time t of this loop of interaction with the environment, the agent has performed a series of actions. So starting from the initial time, it has performed action A-node. And then this has resulted in reward R1, maybe a random object, and an observation Y1. So this Y1 can be a high-dimensional object. So observations could be images of the environment. Could be long lists of variables, temperature, pressure, whatever, degrees of freedom of my physical system with n particles. It can be a very large object, the observation. So this is the observation, and this is the reward. The reward, in turn, is just a one-dimensional object. The real variable could even be this. Here we're discussing always discrete approximations in order to make the notation clearer and the algorithm is clearer. But the general problem of reinforcement learning generically addressed the situation where these objects are real. And then after that, a new action is taken, which may and should depend on the previous things that have happened. And then another sequence of rewards and action lists until we say that at time t minus 1, we receive the couple. So the question is how to choose, according to some time-dependent policy in general, how to choose the next action given all this string of observations, which we may call a history. And we may call it h less than t. There's just a symbol to say everything that has happened before time t, or another notation for that could be h of everything that has happened from time 0 to time t, whatever. These are just notations to compress the notation for a long string of objects. So in general, the problem of reinforcement learning is to find out what are the best mappings between histories and decisions given that we want to achieve a certain goal. And in general, the goal is to maximize over these choices of policies. The expected value of, for instance, the sum, this is one of the possible goals in the future of the rewards times t plus 1. So here, I'm putting all the randomness that I can have, so random rewards. So this is the general task, which is extremely complex in itself. So here, we face a first dichotomy between two possibilities. The first possibility is that we do not have a model available. So we have to rely on this long record of histories. So first option, no model. And then we have just to take these long histories and then sort of process them somehow, compress them, put them into some memory. So the idea is that in this loop of operations, the history, say, at time t is somehow compressed into some smaller set of variables, these memory states. You might think that you have just a physical memory of a computer in which you, instead of having these exponentially large records, you just somehow compress them by some method or, I don't know, erase everything, just like a stuck memory. Erase everything that happened 10 steps before or you just filter out something. So any kind of processing that compresses your information and puts it into a memory. And then from the memory, you decide what to do about your action. So this is one possible way to go. And this is the kind of things that we will explore next. But for now, we are concerned in the second option where a model is there. So we are talking about decision making, which is model-based. What does it mean in practice for us? Well, it means that we know that there is an underlying mark of decision process. So we know that the states in our environment depend on the actions in the previous states in a Markovian manner. We know how the rewards depend on this triplet of actions. But states are hidden to our knowledge. States are hidden, which means that instead of states, we have to use observations. And these observations, for instance, might depend on our current or subsequent state. It doesn't matter. It's just a matter of notation or definition. So we have this, which we call a model of the observations. So if these three things are known, what we can hope for is to be able still to plan in the future, accounting for the fact that we don't observe the system itself. So why is this a challenge? So in order to fix the ideas, but again, I insist that all the treatment is rather generic. It applies to any kind of system. Let's fix the ideas on a specific example. And the specific example is the two state Markov decision process. So we have two states and just two actions. So I can put names to these actions, but that doesn't really matter here. And then these actions can have, of course, only two possible outcomes. So either they send the system in itself or send it, the state in itself or send it to the other state. And for each of these arrows, there are probabilities and rewards issued. So as usual in this notation, this would be the probability of going in two, given that the previous state was two, and that the action taken is one. And this will issue a reward as previous state two, action one, new state two. So these are S, A, S prime, S prime, S, A. And for each of those, you have this labeling of the arrows depending on what they give. So if you have the system as an MDP, now know all the machinery, all the weapons that you have at your disposal in order to solve this problem. You know all this quantities. So you can make a planning, use dynamic programming. And for instance, you end up with a final result that your best action from two is one and your best action from one is two. OK? Just let's assume this. There is some structuring of these probabilities and these rewards that makes it the problem such that the best action from state two is one and the best action from state one is two. OK? And now I want to introduce partial observability, which means that I am not able, as an agent, at each step to know whether the system is in state two or in state one. So what I have instead is an observation at each time. So there is another variable y. It can take, for instance, values one or two. OK? And this variable, loosely speaking, is a measurement of the state, but a measurement which can be noisy. What do I mean by that? I mean that I'm now introducing, for instance, this very simple. So let's say that the probability of having an observation y given s is like this. OK? So first, if s equals one, I have y equals one with probability one minus epsilon. And y equals two with probability epsilon. So what does that mean? It means that if my system is in state one, I use some device to measure in which state I am. And this device tells me with probability one minus epsilon that I am in one, which is correct. But with some probability epsilon, it makes some error or measurements and incorrectly tells me, OK, you are in state two. OK? So this error probability is something that I know. So I know how errors happen in my measurement device. And given this knowledge that I have a model of the observation, in fact, I want to nonetheless be able to control my system to find out which are the best decisions to make, even in presence of this uncertainty. Is it conceptually clear so far? Any questions? OK. Please stop me at any time if you have a follow. I'm sorry. I have a very, very small thing about the part before about the notation of the history. When you wrote a H from 0 to T. Yes. Is that right? Is that from 0 to T? Because when we're evaluating the policy at time T, we are taking all the history until T minus 1. OK, I see what you mean. You're right. I should, to be consistent, I should put T minus 1. You're totally right. Thank you. No problem. Thank you. Yeah, I have another question. Yes, please. Yeah, we used to say that Markov decision process is a memoryless process. But now we are introducing this notion of history, which is affecting how we are choosing our state. Yeah, right. Because that's exactly the point. Because when we have partial observability, we need to move out from the Markovian, OK? So there is a system underline which is Markovian. But since we do not observe the states themselves, we just observe some projection of the state space, if you wish, then we have to deal with the fact that our decision making will not be Markovian. We'll have to rely on history. So the example that you must keep in mind is the example of the cart and pole, right? So if you have access only to positions and angles, you need to rely on the history in order to be able to control the system. That's just the same idea, OK? And your system, of course, your cart pole system in the coordinates x and theta is not Markovian. Because it's not enough to know what the position is in order to predict the future, all right? OK. So these are all very good points. And I'm happy you react on these conceptual issues. All right. So and the same happens for state 2. Only the situation is reversed. It makes the same error. Could be different errors. It doesn't really matter. I mean, this is just to fix the ideas. Sorry, I'm writing the same thing again, which is not what I wanted. OK. OK. This sounds reasonable. What is the problem now? Well, the problem is that how do I decide? If I don't know the states, how do I decide? I might solve my MDP problem, which tells me if you are in state 2, choose action 1. If you are in state 1, choose action 2. But if I'm not sure about what state I'm in, what should I do? OK. So the intuitive idea is that, OK, I have had some observations so far. And these observations are actually telling me something where I am, right? Even if not with certainty. If I end up, I know the model. If epsilon is very small and I receive an observation 1 or a series of observations 1, I can be pretty confident that I am in state 1. OK. So how do I formalize this idea that the observations are telling me something about the states, even though I don't observe them directly? If I can manage to formalize this idea of inferring the current state from the series of past observations, if I can perform this inference step, then I can decide. I can control. I can plan. That's the underlying idea. OK. Very good. So that's what we're going to set ourselves for today. Defining the formal framework, which integrates the notion of sequential inference, so inferring states from observations with decision-making. So just to sketch the conceptual plan for today, let's just review some very basic description of what the problem is. So a Markov decision process, broadly speaking, is a controlled Markov chain. So you have a Markov chain that is a process which goes from one state to another. And you have a series of parameters, which is the policy, which lead you to control this Markov chain, just like you have a dynamical system. And you have some forces over which you act in order to steer your system in your direction or the other. Now what we're going to introduce now in the following are partially observable MDP. So that's our goal. So what are they? Well, they are controlled hidden Markov chains. So that's the parallelism. In the first case, you had a Markov chain and you wanted to control it. Now you have a hidden Markov chain. There is a Markov underlying Markov process, which you cannot observe directly, but you observe indirectly. And you want to control this hidden Markov process. Is that clear? Does it make sense? OK. So in order to set the stage, let's work with the hidden Markov chains first. So hidden Markov chains are also known as the hidden Markov models, which is an acronym that you might have encountered, which are widespread applications, of course. As you can realize, all the things that we are discussing are extremely general. So what's the idea of a hidden Markov model? So again, here there is no control. So the general idea is that there is some Markov system. So there is a system which evolves in time according to some transition probabilities. So there is some transition probability from one state to another. You see there are no actions yet here. So we are not controlling the system. It's something which it's evolving of its own accord according to some P, which we know. This is known. This is the model of the environment without the actions. And at each time step, for simplicity, we receive some observation about the system. So in our previous example here of the two states, it's just like you had a system which is jumping from one state to another according to some probability. Let me revise it here for just like that. Now there is no control. So there are no actions. The system can only do like this or that. This or that with some probabilities. So this is, for instance, is probability of 2, 2. This is probability of going to given 1, P1 given 2, P1. And all these quantities are known. But what is known? Sorry about that. And what is not known here is the states themselves. So like before, we make observations which can be error-prone. So at every time, we get a 1 or a 2, which might or might not be exactly the current state we are in. In the limit where the error probability is 1 half, we get random sequences of numbers, which means that we are not actually observing anything. So we can go with this epsilon from a situation where we have absolutely no information about the system, about the state, or to a situation where we have perfectly no at each time in which state we are. So like I said, a very important and relevant piece of information would be to identify the probability of being in a certain state at time t, given the previous history of observations. So is the notation clear here? By this, I mean a shorthand for observation at time 0, observation at time 1, observation at time t minus 1. So this is going to be very useful, because as I collect information about the system, I might be able to localize my probability around a certain state. And if I have this knowledge, at least in a probabilistic sense of where my system is sitting at a given time, I can formulate some sort of decision-making problem. So the technique that we are going to use now, just to make a connection with maybe you have encountered this before, is goes under the name of Bayesian updating. So if you have already heard about it, we are going to just revisit this subject from the specific viewpoint of Markov models. If you are new to the subject, never mind. We will go through this in detail. So the key object we are looking for is this one. So now we're going to go through a series of steps slowly in order to derive a rule which allows us to construct sequentially these probabilities. So the basic idea is that as the system evolves in time, at every step, we can incorporate sequentially new observations as they come and update our current probability over states. That's the goal. So in order to do so, we just go a little step at the time. So what is this probability here? OK, but let's go by steps. So here I'm just rewriting. So yeah, maybe for just for allow me to do just a slight modification, sorry for this. But let me just define this up to time t. Maybe just use a round parenthesis instead of a square parenthesis on the right. So it's like an interval. You prefer a square round parenthesis? No, no, no. I was just in just like an interval that you include the z. Yeah, I'm using squares because this is the customary notation for discrete intervals. But OK, I mean, so this means that I'm just evaluating the probability of being in this state after I make the observations. This is just one step a shift with the previous one. You can find both of them in the literature doesn't really change the substance. OK? So what is this thing? OK, just by definition, this is also if I marginalize over all the possible states up to time t minus 1, so I sum over all the sequences. I sum over all possible previous states of the probability of the sequence from 0 to t given the observations from 0 to t. OK? So this is just I'm saying I'm focusing only on the probability of the current state. I don't care now about what the previous states were given my information. In general, Bayesian updating deals with more complex objects. So we can reconstruct pieces of history, whatever. But here we focus only on given the history, I want to know where I am now. Or I want to infer where I am now. OK, so is this notation clear? So this sum below here means that I'm summing over s0, s1, s2, st minus 1. Let me make it even more explicit in the corner. So for future reference, this corresponds by definition as sum over s0, sum over s1, sum over st minus 1. I'm just writing this marginal, the marginal here as an explicit integration over all variables that I'm not caring about. So why do I do that? Because now I can use the definition of a conditional probability, which tells me that this object is also equal to the sum over all previous states of the joint probability. Just definition of conditional probability now applied to a string of values. Again, nothing happening here. Just simple manipulations of the property of probability distributions. So why did I want to get to this ugly object here? Because now I can use the Markov property of my system. So I can write the joint probability distribution of states and observations as a long product. So this object here at the denominator, let's isolate it for a second. So let's isolate this term here and write it down explicitly. So the probability of a string of states and observations, what is it? Well, first and foremost, there is the initial state. So you see how the process works. At the very beginning, you have a state, then you make an observations, then you jump to another state, and you make another observation. And here I didn't plot it, but it's worth putting here that what is governing here is the probability distribution of the y given s. So let's start from the first item. And first and foremost, there is my probability distribution. Let's call it raw note of my initial state. I have to start somewhere, and this is my distribution over the states at the initial time. Then I make an observation, which I'm going to call y0. And this depends on the state s0. And then I make a transition to a new state s1 given s0. And then I make an observation, f y1 in state 1, 1, and so on and so forth. So there is this long product of transitions until I get to state t from state t minus 1. And I make observation yt from state s. So this is where I use the Markov property of my underlying process. These probabilities are all products of probabilities. And all observations are independent. That's what I'm using here. So this is interesting and useful because this allows me to rewrite this object in the following way. So I can isolate what has happened at the last step. Therefore, this is also equal to the probability of everything that has happened until the previous time, t minus 1, times the last step. Everything plain. Here, I started using assumptions about my system. That is the Markovianity. Very good. So now what I'm going to do here is I'm going to use this object and replace it in the numerator. But before that, I have to manipulate a little bit also the numerator. Sorry, I'm mixing up the denominator. Which is nothing but the marginal over the states of what I have up above. So I'm going to do the same trick here. So the probability of the sequence of observations up to time t is just the marginal over all sequence of states of the joint probability. Again, nothing to be seen here, just marginalization. But then I use this factorization property above. There's one. And I can split this sum below here into the sum over the previous history and the sum over the last step of what? Of the probability, joint probability over. This last step is coming from here times probability of ST. OK, now I can pull out this probability here. I can pull it out from the sum over ST because it doesn't depend on ST on the last time. And I can marginalize. So this becomes the probability over the previous sequence of observations. So sum over S, this one, times the sum over ST of B of ST. So again, this is not surprising. Since my last observation depends only on the last state, all right? Sorry, here I forgot one thing. So let me just rewrite it properly. Sorry, the sum is still on over there. So I have to write this properly as sum. So I have the probability of the conditional, the histories. Yes, this one. Then I have the probability of Y0, T1, 1. And then I have the probability of ST. OK, this one is correct. So this option here is the one which can go out of the sum. And here, this sum is further splitting to the sum over states until T minus 2, the sum over ST minus 1, the sum over ST. And then I repeat this. So let me here. So this object goes out of the sum, this object here. This history, I can split it into the past, the last one, conditioned on the observations until times 0, T minus 1. And this is times. Now I can sum over the histories until time T minus 2 here. And finally, end up with this. So OK? So it's a sort of lengthy and little bit cumbersome operation. But I hope the spirit is clear is that you, by this fact that the process is Markovian, you can peel out the last thing that is happening and sum over all the things that happen first and to isolate. So the understanding of this formula is that remember where we started from? We started from this. What is the probability of making a sequence of observations up to time T? Well, this is telling me it's the probability of making observations up to time T minus 1, this term. And then given that sequence of history up to time T minus 1, there is a conditional probability of being in state S, T minus 1, which is this one object. And given this conditional probability, I will jump to state S, T and make observation Y, T according to this formula. So I am unrolling all the conditional probabilities in this scheme. And so if I take this and I combine it with the green formula above, look here. I have this term here at the numerator. I have this term here at the denominator, which will give me the condition of probability of the past history with respect to the observations. So when I combine the green and the yellow terms together, what do I get? Well, I get I'm resuming from here. Excuse me, can I make a question? Sure, just yeah, please. In the orange passage that we made from the first one to the second one, we wrote the same stuff, but we added a new probability of Y from 0 to T minus 1. And I didn't. Because it's not the same thing. It's just this object here. We are using the fact that exactly this formula, the green formula here. So we are writing this as the product of what has happened until T minus 1 times the last step. And then this is sum over 0 and T. We split it into the sum between 0 and T minus 1 and the sum over S, T. Is that? Yes, but now in the following passage, we are writing the same. It's the same as above, but we added the P of Y. Yeah, here we're just saying that the joint probability of states and observations is the probability of states conditional on observations times the probability of observations. Ah, it's a condition. OK, I didn't know. OK, OK. Just conditional probability definition. OK, OK, sorry. No, it's OK. I mean, it's very good if you ask specific questions about the passages. Actually, I have another question. In the last passage, we are like erasing the reference to S from 0 to T minus 2. And I didn't know. Yeah, we are summing overall possible occurrences. Yes, we are not erasing. We are actually performing the sum here. So if you look at this sum here, this is the probability over the last step, the last step, T minus 1, and all the previous steps. But now we are summing all of them so we're taking the marginal. This object here is the marginal of the one here. OK, thank you. OK. So yeah, there's a lot. But you can review that individually, or you can ask me questions. I'm very happy to guide you through the details. But the basic take-home message is that we are just using properties of probabilities. That is, how conditional probabilities are defined, how marginally probabilities are defined. And we are using the Markov structure of the problem. That there's no other ingredient we are putting into the game at this stage. OK. I'm sorry, I have another question about that. But a little bit up, what is that notation as in the penultima passaggio, where there is a PS from 0 to T minus 2, and then product ST minus 1 is just for isolating the. Yeah, it's just that to say that. So let me make it a little bit more explicit here. So what I mean by this, remember that by definition. Let's in the top right corner now it is where you wrote the S from 0, a little bit down. Yeah, let me just explain it because it's from here to, it's from here to here that you want to know what happened. Yeah, right? I want to know exactly what is that notation where you have the pointer now. Yeah, there. Yeah, exactly. Yeah, that's what I'm about to explain. So this object here, let me let's just take one step back. This object here is state 0. It's a sequence, right? State 0, state 1 up to state T minus 1. This is the definition. OK, what I'm doing here is just that I'm writing this as a S0, ST minus 2, union ST. OK, yeah, so that's just a way to represent the sequence with the element isolated. Yeah, exactly. Oh, OK. So we're going to become a big string plus one item, OK? Now, thank you for asking, so everything is transparent. All right, so that's where we were. This means that we have to combine these two things together. And if we do that, I just take it back from here. It's a sum states from 0 to T minus 1. And then the green part above is this one. It's the joint for the joint. Yes, it's joint probability of times T minus 1. Observations from 0 to T minus 1, for God's sake. This is OK. That is the observation by TST, the transition ST plus 1 ST. So this is the numerator. Sorry for the ugly user space. Yes, the green part and the numerator is this object here. So it's P. Sorry, I'm struggling to let this thing fit into space properly. This clearly shows that this object is not a blame board. OK, so this probability of Y. So I'm just coping from above. And then we're almost done here because there are the two last steps to be taken. Again, it's always the same kind of game we're playing. So let's collect this one with this one. So this is the ratio of a joint probability to a marginal probability. So it's a conditional probability. So this means that I have some history of the time T minus 1 probability of the transition times the observation divided. Here, just what is left from the denominator as TST minus 1. OK, and then, again, very last step, this sum here, I'm going to split it once more into the sum over S from 0 to T minus 2 and the sum over S. And then this first sum here, I can marginalize. Good, just like I did before, I sum over everything that has happened before time T minus 1 now. So this allows me to get to the final formula, which is sum over states of time T minus 1 of the probability of states minus 1 given the history until time T minus 1 times the transition probability. While I'm writing, I realize that I mixed up T plus 1 and T here. OK, and this is divided by sum over S T, sum over S T minus 1 of the same object. And we're done, OK? We're done because we have to go a long way back up to say what was on the left-hand side of that all. Well, on the left-hand side of it, we had the probability of being in state S T given the history up to time T, OK? So all these things is equal to the probability of being in state S T given the history and the history from 0 to T, OK? So why are we happy now? Well, because this gives us a sequential rule to update our probabilities. If I have my probability of being in a certain state given the history of observations up to that time, I have it here and I have it here, I can combine it with my probability of transitions and with my probability of observations to get the new updated probability, OK? So I can go along with time sequentially. If I start out with a certain probability distribution of our states, I will include my new observation and so on and so forth. But what is this? Well, this is essentially Bayes formula. So this is the essence of Bayes. So what is the difference here, OK? So let's try and recognize what the terms are and what the difference are with the most familiar Bayes formula that you might have. So this is what I would call the posterior at time t, this object here. And this one here would be the posterior at time t minus 1. And this thing here is the likelihood. So the only new thing with respect to the usual Bayes formula is that now you have this term. So summarizing, if at any time you have a probability of a state given the observations up to time t minus 1, you have this probability, OK? You can use your knowledge of the model B and F to derive the new updated probability, even the new observations. So just to make things more familiar to you, let's consider one simple case that is if the transition probability is just the identity. So your underlying Markov system is actually not changing states. It's always sitting on the same state. If you are in this situation, you see that this sums here, OK, you can make them explicitly because this object is an identity. This probability distribution here is an identity. So in this case, this is an example. So in this case, what do your formula become? Well, your formula just becomes the probability of being in the state s. Now there's just one single state, so I don't even have to put a time index because it's always the same state. Given the sequence of observations up to time t is just the probability of that state, even the previous sequence of observations, times the likelihood of the last observation given the state divided by sum over s. So this is the most probably you all are familiar with this version of the base formula. There's priors, likelihoods. Below you have what is called the evidence, which is the normalizing factor for the numerator, OK? So what has changed here is the fact that since the system can change from one state to another, your prior also moves in state space, OK? So if you are 99% sure that at a certain time you are sitting in state 2 and your mother says, now you're going to make a transition to 1, then your belief will move with you, OK? So if you are sure that you are in state 2 and your mother tells you now you jumped to state 1, then now you're sure that you are in state 1, OK? So this expressed the fact that there are two competing parts in the way that beliefs are updated, so because beliefs, OK, posterior for the moment, posterior updates are made of two steps. First, likelihoods. So likelihoods increase our information about the state. So useful observations will tend to shrink our posterior distribution, OK? So the more we know, the more the belief shrinks. But then we have also the transport of likelihoods of posteriors. Transport means that if states change, the posterior follow this change. And since the system is stochastic, this tends to spread out the posterior, OK? So if you make a change from one state to another, but this change is stochastic. So for instance, suppose that you are now with a very narrow posterior over state 2, OK? So you're quite sure that you're sitting on state 2. But now your model sends you with probability 60% into state 1 and 40% on state 2, OK? So I am here in a situation where I have two states, 1 and 2. And then I have a probability 0.4 of getting back here and 0.6 of going there. Now, even if my probability, my posterior of being here at a certain time t, given the previous history up to that time, is say 0.999, when I make this step, this will erase a lot of information. Because after this step, this transition step, basically I will know with probability approximately 60% if I am in state 1, approximately 40% that I'm in state 2, OK? So in general, there is this competition between observations that increase information and the dynamics of my system, which tends to make it forget, OK? Fine. So this notion of belief updating, which is a glorified version of Bayes' formula, which accounts also for the underlying dynamics of the hidden market process, is a key ingredient in what we're going to do after the break that is construct a theory for decision making in presence of partial obsolete. Any questions so far? Yeah, if I can. It is correct to write. Maybe it's a mistake. I don't know, or maybe I'm wrong. P of ST given ST plus 1 in the last passage? Yeah, you're right. I've been making some plus and minuses. Thank you for noticing this. There might be other ones. Sorry. Yeah, thank you for pointing this out. So when I will upload the file, if you notice other misprints, please let me know so I can correct them. Thank you for being so attentive. Sorry, someone is speaking, but I can't hear you. The volume is quite low, at least on my side. Could you please raise? No, it's not getting any better for me, at least, for the others. OK, maybe now it's better. That's better, thank you. OK, I was saying that maybe the T plus 1 came from the initial definition of the green formula when we unrolled all the terms. Maybe it was there. So I tend to say that the T plus 1 comes from the fact that I'm an old guy and I'm getting more and more Alzheimer over the days, in the sense that no, it was meant. The confusion arises from the fact that in my mind, from here, it says that you could take the step from T to T plus 1 or take the step from T minus 1 to T. And if you mix the two together, that makes a mess. Yes, of course. I tend to favor these hypotheses. OK. There should be no T plus 1 if I do everything correctly from the beginning to the end. So if you spot 1, it's most likely a mistake of mine. Sorry, I want to ask something. I was reviewing the files you gave us of the whiteboard. I noticed that they are a bit blurred and you cannot distinguish well what you wrote sometimes. So I was wondering if there is a way to make them more different. Yeah, I noticed that as well. I tried a couple of attempts, but was not actually able to improve the resolution. I'm afraid that's the problem. But I will see if I can do some post-processing to make them better. Thank you very much. I will try. OK, thank you. OK, good. I think it's been a long ride, so let's take a break. And we reconvene, say, 20%. Sorry, can I make another question if possible? I didn't understand what happened in the pink passage during the messages. This one? No, above. OK, this one. This one? There's a few pink ones. No, I think there is a bit of delay in the screen. Oh, OK, sorry. Below. Below again. Is that one? Yes. OK, so what I did here is I repeated the same kind of things that I did earlier. So this sum here, I split it into the sum until time t minus 2 and the time and the sum over t minus 1, which is this. Is that this first part OK? And the second part is that this history again, I'm doing the same thing that I did above again in pink, which I notice also that's an error here. This is minus 1. So this whole sequence, you actually split it into the first part of the string and the last item. And here I'm doing the same thing up above. So I'm writing this object here as s from 0 to t minus 2. And s t minus 1. And then what I do is basically now I'm using another color. This is bright yellow. So this part of the sum here, the yellow one, I use it to sum over all the history before t minus 2 and included. So this will leave me just the dependence on this item here, which is the object which is 0. So I'm marginalizing on overall states probabilities before time t minus, strictly before time t minus 1. Is that any clearer? Yes. OK, thanks. Another thing about what we are taking into account from our past observation, why we didn't take into account also the action that we made and what happened. Right. So because, like I said at the beginning, we're going to do things one at a time in the sense that for the moment we start without control. So this is just the hidden Markov model without control. And in the next part, we're going to add the actions and everything. So the next step will be what happens if I make an observation and an action. And I will combine these two things together. So you're perfectly right. At this first stage, just in order not to make the, it's already pretty cumbersome. So if I added also actions that would have made it more complicated, so in my mind, it was a way to simplify and split the process. What happens without control? And now we add control. Doesn't make sense? Yes, yes, thanks. Sure. OK. Yeah, let's make 10, 25 then. See you later. OK. Here we are. Now we lay out all the ingredients to formally define what is a partially observable Markov decision process. OK. So now, just for a little while, I will just introduce this at the level of definitions. And then we will see how to work on it. We will see how to work it out and what to do with it. So like I told you earlier, partial observable Markov decision process is an extension of the notion of a Markov decision process. So it should include the same ingredients as an MDP plus some others. And some of the ingredients are already familiar to you. So there should be states and actions. A new thing with respect to the Markov decision process is something that we've been discussing already extensively is that now we have another set of objects which are the observations. Or more generally, you may also think them as the percepts. If you want to include rewards, OK. So this could also be why it could also be sort of including rewards, just like rewards and the actual contextual observations say, oh, OK. You can find both definitions in the literature. So rewards could be seen as informations about the state of the system themselves. So depending on how you formally define the system, you could think that rewards are functions of observations. Or you could have rewards and observations doesn't really matter what the actual definitions you use. The important is to keep in mind the concept. So then there are other things which are again, sorry, which are again customary from MDP in the sense that you have your model for the environment. That is transition probabilities from states to other states given actions. And you have your average rewards. Here you could generalize a bit, but let's keep the notation as close as possible to MDPs. So these are rewards, average. Then of course, since you have observations, you have to provide a model for observations as well. That is how do you expect, sorry, how do you expect rewards to be distributed, sorry, observations to be distributed in probability. And in the specific case of MDPs, the notation that is used is the following. So the observations will be dependent on the action you take and on the new state, so the state on which you land on. Again, you could define them otherwise, depending on the previous state or depending on both states. This is the most used notation, so I will stick to that. But again, these are minor changes. Every problem can be mapped into another bias suitable change of definitions. So you shouldn't be worried about it. This is quite trivially the observation model. Good. So a new thing that we have to introduce at this stage is the notion of the posterior. So we have seen so far that the posterior, you can see it as a way of encoding the history of observations. So in mathematical terms, you can actually show that the posterior is a sufficient statistic for the history, which means in mathematical terms that all the information that is accumulated in a sequence of observations and actions is actually encoded into the belief. You're not losing any information by using the posterior rather than the full history. So I'm repeatedly using the notion of belief because this is how posterior are called priors and posterior are called in the PMDP. These are B of S is called the belief. And it is the probability distribution over states. We are going to routinely call these beliefs what you should keep in mind. And these are the posteriors under Bayesian updating. And in fact, this is made clear by the fact that we are giving a rule how to update beliefs, which is the following one. Let me write it and then we comment on it. So this object here is the updated belief. So it's Jewish, the posterior. Given a certain action taken and a certain observation made. So my system goes from one state to another due to some action. And in the process, some observation is made. And as a result, I update my belief about where I am in state space, which state am I occupying at the current moment. And this is just a reflection of what we did in the previous hour. It's just the likelihood of the observation at the new state, this object which I have introduced here. And then there is a sum over the previous belief, the prior, if you wish, which has been transported by my process. And all this has to be normalized. Sum over S prime. So this is exactly the same formula that we wrote previously for Bayesian updating, only with a slight change of definitions and with the fact that we have introduced actions now. And now the key thing is that in POMDPs, the policies, since we cannot define them as mappings from states to actions, are mappings from beliefs to actions. So a policy is a function from beliefs to actions. So by this B with a dot, I mean the full vector of beliefs. It's a function of the vector of beliefs. So this is a conceptual point which has to be extremely clear. So the idea of a POMDP is that I start with some prior about where I am in state space. So for instance, in my two-state model, I may say, OK, I start with 50% probability on each state. This is my prior. Then I have this policy, any given policy, which says, OK, for that belief, for that prior, you should pick action A with a certain probability. For instance, 50% for each action in our simple model. Then once you perform the action in your mind, I mean the decision maker in his mind says, OK, if I had this action, this would have led me to a belief B prime. And then in the new belief, what do I do? I again consult my policy. And extract a new action A, and so on and so forth. So that's what the idea is. So what is the goal? So the goal is the same, is maximize over the policy the expectation of the discounted rewards. Where by this, I can make it even more explicit. So there's more confusion in what I write. That is, I mean. So but here, there's one thing that has changed with respect to the previous definition. Formally, it's the same. But the important thing is that this average here, this expectation value, is also over beliefs. So we are not averaging only with respect to the stochasticity of the process, like in an MDP. But we're also averaging about our beliefs, about how we think, how infer that we are distributed over states. So since this is important, let's open a little bit parenthesis to see what this actually means, OK? Sorry, just can I ask you only one small thing before we go ahead? Before, when you were writing the observations, you wrote that y is equal to r and up, even more up. Oh, yeah, this is notation. It's a small o. Sorry. OK. And what is that for? OK, this means that if you want, you can explicitly split your observations into rewards and contextual observations. OK. So it's just to tell you that sometimes in the literature, if you happen to read the paper on PMDBs, you will find out these other kind of description. So but it's just sort of the literature is not homogeneous about how to represent things. Some people say there are observations and rewards are functions of observations. Some of them say there are percepts, which are a mix of rewards and contextual observations. So it's not very important, OK? But just to keep you aware that there might be slight differences in the notation and definitions, OK? OK. Something that is as crystallized as for MDPs. So there might be slight variations in notation from one approach to the other. But the substance is the same. So I don't want to get too much confused about notations, but focus on the principles. OK, OK. Thank you. Sure. OK, so just to make everything absolutely transparent, let's see what this means in terms of averages over beliefs. Starting out with the first terms in this sum, right? So this object here, if I make it explicit, it's just expectation at the beginning as 0 A0 S1 plus gamma expectation of the reward S1 A1 S2 and so on and so forth. OK, so let's just write explicitly the first two terms in order to see where the beliefs come into play, OK? It's already clear at the level of the first term. So how do I construct this average? So first of all, remember that there's some stochasticity in the process itself. So what this actual object means is that I have a state, an action, and a new state. And then how do I go from one to another? Well, I have my policy A, which depends on my initial belief. And this is the first part where this object appears, OK? And then I have the new state as prime, which depends on the model. And of course, here I have to sum over all possible actions that I can take with their own probability distribution, with all possible outcomes. But I also have to sum over the S, because I don't know where I am. So and this is the second part where the belief appears, OK? So you see what I mean explicitly by this average is that I am summing over my probability distribution over states, which is given by my starting beliefs, my prior. And I pick an action according to that prior, and then I average over. OK, so there's this important message here is that there is this additional level of averaging over the belief distribution. And at the second step, that's pretty much the same, only that now you use your updated belief. So if I write it explicitly, this will be sum, and then I will fill it in. So now I have always reward states and actions, only that now my policy is taken with respect to the new belief, which depends on the action I took at the previous time. So let's call this a1 and this a0. And the observation I took at the previous time, 0. So this is the policy, and then I have my transition probability from state and action. And then I had the probability with which my previous observation was made. And then I have now my belief for the current state, which again depends on the previous action and why. And here I think I have everything because I've summed over states, actions, observations. So I have to sum over this. And I also have, of course, policy a0, a0. So there's a lot of stuff already coming into the game. So this object, you have to average over the belief at the second time, which depends on the action I took. So it's a very intricate sum, which makes, of course, the problem quite challenging at first sight. The sum is over y0, a0, s, a, s prime. The sum is over everything because you have to come up with something that doesn't depend on s. So s prime, s prime, s, s prime here. OK, so this is a1, if you wish. OK, never mind. The only purpose, these expressions are clearly intractable as they are. So the only purpose of showing them is the fact that there is a dependence on the history through the beliefs of all these things that we are going to sum up into the future. OK, so now one could take the approach of facing this problem directly and trying to solve it. Luckily enough, there is one important consideration that saves our day. So that rather than doing a ton of analysis and writing down pages of formulae, we just have to realize one fundamental thing. And this will lead us immediately on the solution of how to solve a partial observable decision, macro decision process. OK, now this is highly conceptual. So just stay with me and don't think about formulas or anything. It's just the idea which is important. So an MDP, as you know, lives in a space of states and with actions that bring you from one state to another, et cetera. So you've seen this already many times. Now a POMDP lives in a new space, which is the space of beliefs. So the space of beliefs is complicated objects. We've been discussing this when we were talking about the policy gradients. It's a combination of simplexes. OK, so it's a very complicated, high-dimensional object which contains probability distributions. So this is a space of states, which is something like a hyper-tetrahedron in the space of states. OK, so this is immersed in the real number of states. And it's the probability distribution over the space. So one belief is a point in this hyper-tetrahedron. It's a vector which belongs to hyper-tetrahedron. OK, so the first key observation is that Bayesian updating, that is this formula here, is sending beliefs into new beliefs. So what is happening here is that after you make an observation, your system will jump into a new belief B prime. And this new belief here depends only on the action you took and the observation you made. And if there are several observations that can be made, this just means that you can jump into different points. As many points as observations there are. OK, so if you look closely at this thing, you realize that Bayesian updating is a Markov process in the space of beliefs. So why is that? Because it depends only on the previous belief. The new belief depends on the previous belief. And it depends on something which is the action and probabilistically also the possible observations with some probabilities that you know about. So if you make just the conceptual effort of saying that let's replace states with beliefs. So let's replace the notion of a state with the notion of a probability over states. We are lifting to this higher dimensional space, actually continuous space. This means that a POMDP actually is equivalent to an MDP in belief space. The Bayesian updating, you can see it as a transition probability in belief space. So you could define a sort of probability of having a new belief, even the previous belief, and the action. And your average rewards will be average rewards with respect to beliefs. So without working out all the details, it just should take my word. And then I will point you to references where this is explained in full detail. If you are curious, I hope I convinced you that working with partial observable market decision process is just looking at an MDP in a much more complex space, which is the space of probability distribution over states. Where do the two things come together? Well, the two things coincide when your observations are perfect. What does that mean? If you observe directly the state with certainty, what will happen is that your observations will bring you in the corners of this belief space. The corners of this belief space are certainty. You know that you are in state S. And in that case, this collapses back to an MDP. So this is the geometrical intuition that tells you that this is an extension and it's a proper extension. If you replace algebraically the observation probabilities with the delta functions that says, OK, you are in that state, you just recover the mathematics of the MDP. All of this is lengthy, but I hope you trust me. And if anything is not clear at the conceptual level, that's right time to ask. If you don't believe me that I take it as a perfectly good expression of skepticism, but I can point you to all the calculations that prove that what I'm saying here is not the force. OK? So what is the upshot of all this? The upshot is that there is a Bellman equation we have. Just as much as we had a Bellman equation for the MDP, we can write down a Bellman equation for the MDP. OK? So let's first write it down. And then we comment on how nice is it and how horrible it is, because it's both things at the same time. OK. So bottom line, first, and then we discuss it. So the optimal value of what? Well, it cannot be the optimal value of a state. So it's an optimal value for a belief. The optimal value of any belief, which means if at any certain time I find myself with a certain belief about where I am in state space, I can compute the optimal gain that I can get in the future given that belief. Again, remember, this is a problem of planning under uncertainty. So all of this is something which happens in the mind of the decision maker before even getting to know what kind of observation it will make. It's just planning. This is a little bit mind-boggling, but one has always to keep that in mind. OK. So now you recognize that the structure is extremely similar to the Bellman equation you're familiar with. It's the maximum of our actions. And here you would have the probability distribution of our new states and the sum over s prime. What changes here is that you don't know the states, OK? So you have also to put the beliefs here and sum over this belief. So this is the average transition over the beliefs. And then also what matters is that you have an additional term is that there are also observations to be kept into account. So you have to average over them as well. And this is also a term which is new with respect to the previous Bellman's equation. And then you have something which is more familiar. That is r s a s prime plus gamma. And here you have the value of the new state in the Bellman equation, right? But now it's not the new state. It's the value of the new belief. So it's v star or b prime given the action and the observation. You see the structure is basically the same with the only difference that now we have beliefs over which to average and observations over which transitions depend. But otherwise, the structure is the same. The best thing I can do is the maximum between what I do in the next step in instantaneous reward plus gamma, what I will do from that belief onwards. So the point is means that in principle, you could use value iteration to solve it. And here I put immediately a question mark because here comes the bad news is that if an MDP, if the Bellman's equation for an MDP, you might remember amounts to finding a fixed point for a vector because the value function is a vector. Now the value function is a function. And the Bellman's equation is a functional equation. So b, the belief is a vector. And the value function is a function of a vector. And this is a Bellman becomes a nonlinear functional equation. And nonlinear functional equations are hell or heaven depending on your tastes. Mathematicians might thrive and practitioners of computer science might die. So in fact, you can prove by complex mathematical arguments that the level of complexity of solving this equation is in a class which is called the p-space. You might have heard about the p-in-on-p-complete problems. This is p-space-complete, which is entirely another championship, completely a different contest. So in general, this is a super hard problem to solve, except in some very favorable situations. What are these favorable situations? Well, first and foremost, the key step here is that you have to know how to update the beliefs. So if there is any situation in which the belief of dating is simple, maybe you are able to simplify your problem. So now I'm asking you a last effort to try and sample you and ask you, do you know of any example where this operation of belief updating, which is made here, forget about the actions? Just base formula. Do you know about situations where the base formula is simple to iterate, where you can go from prior to posterior easily? There are some conditions that allow you to do that efficiently. Any suggestion? You must rely on some structure. It's not something that you always do easily, but there are some specific cases in which this task is global. I don't remember how it's called, but maybe it's when you combine posterior and likelihood and you have an own distribution from which you can sample. Yeah, great. The name that is sleeping is when there are conjugate priors. Yes, sorry, I forgot about that. There is a whole class of probability distributions which have this nice property. For instance, just to name one, gaussians and gaussians. If the likelihood is gaussian and the prior is gaussian, the posterior is gaussian. How is this good news? Well, this is good news because you don't need to use now the full space of probability distributions. It's enough to look at the space of the sufficient statistic of this distribution. That is, you just look at means and averages. So this is a great dimensionality reduction of your problem. Of course, it works only when it works. Only when your underlying model has the nice properties. This is one example. Another example, if your likelihood is Bernoulli. So it's just zeros and ones, coins and head. What is the prior, the conjugate prior? Anyone remembers by heart? It is binomial. No, not quite. So the binomial is another distribution. So if you start from a binomial, you will not end up with a binomial. I think it's the beta distribution. It's a beta distribution, right? Exactly. So if you start out with the beta distribution, which is a distribution of numbers, real numbers between zeros and one, and you have a likelihood, which is binomial, then you end up with a beta function. Actually, the family is much larger than this, in the sense that if the prior is in the exponential family, which encompasses these examples and others, then the posterior is also an exponential family if the likelihood has certain properties. So you can create a large class. But that doesn't cover all probability distributions. So still this situation, remember, it's non-generic. So in general, solving the Berman equation for PMDPs is a hard task. Now, there are algorithms that now are doing very well, but they are sort of combining sort of heuristic ideas with direct solving. And just to mention the kind of ideas that you need is that, remember, you want to do sort of a value iteration. Sorry, where I am, I'm here. So you want to do value iteration in this space of continuous space, OK? So one thing that you could try that doesn't work is just to discretize the space. It doesn't work because the space is too large that you fail. So you can do other things like, for instance, trying to find out relevant points by pre-sampling the space. So there are many sorts of interesting smart tricks that allow you to get the approximate solutions of this problem. But this is something which is really sort of current literature, OK? So we're not spending much time on that. But what we are going to do instead tomorrow is to apply this kind of framework to the two-armed Bernoulli problem, OK? So the two-armed Bernoulli problem is very nice because the transition probabilities are simple. The state is always the same, which simplifies the problem. And this means that we just have to use the usual base rule. And then the coins are Bernoulli. So we know that if we start with a bit of distribution, we end up with a bit of distribution. So what we will see tomorrow is that this space of beliefs, which is a nasty high-dimensional space continuous, becomes a discrete space for the two-armed Bernoulli problem. And we can use value iteration to solve exactly the Bayesian bandwidth. So that's the plan for tomorrow. So if you want to read more about this, a good reference is a review paper by Spahn. Maybe it's two authors, but yeah. And the paper is a partially observable marketplace. I will post it in the Slack channel soon after we end up the lecture. So there you will see the oldest kind of things laid out and some algorithms to solve PMDPs for finite horizon. For instance, this is something which was sort of a classical approach. These are easily become very, very complicated in the general case. But tomorrow we will put our hands on a specific problem and see how it plays out. Any questions? So far so good. There's a lot to process, but hopefully tomorrow the example will make it clear. All right. Then I think we are done here. Stop chatting and stop recording.