 All right, so we we spent the last lecture a part of the previous lecture setting up all the stage and information about how to tackle mark of decision process, which I remind you are defined as the situation in which one has to optimize a certain goal, which is as usual the discounted sum of expected rewards in the future over some policy. Given that everything is known about the environment. So the agent receives as information the states of the system. So from observation is perfect. The system has also the agent also has perfect knowledge of the laws of evolution of the system. That is the transition probability from one state of the environment to another given a certain action. And of course the structure of the rewards. Okay, so given all this information. The decision making problem boils down to a problem of computation and I've shown you that we can derive the bellman equation in its softened version and then take the limit and obtain what is called the usual bellman of the malady equation. And I I've shown you how in general one can tackle this equation for the value or for the state action value function, which is called the quality q either of those two work equally well with each one with its own pros and cons, which I which we discussed yesterday. The basic mechanism by which you obtain the solution is an iterative algorithm. These techniques in general go under the name of dynamic programming. So if you stumble upon this definition that that's what it's meant by dynamic programming in this specific context. They also go under the name of value iteration there are several variations thereof as you can imagine there are techniques in which you work with both with the policy and the value. So you start off with a guest for the policy and the value and you iterate jumping from value to policy from value to policy. Okay, so you can think of several ways. We've been writing fixed point equations, right, which were well defined. So there are many ways of getting this fixed point by iterative methods. So today, we will consider the situation which is more interesting in in practice, even though it has its own limitations, where the situation when you don't get complete information from the environment. But you know the laws by which your environment reacts to actions. And you know the structure of the rewards. And you know also, as we will see a model for these observations. So what's the physical analog that we have in mind. Suppose you want to control the position. So suppose I have a choke here rolling on the on the on the desk right and I want to move it over here then I have some control some force which I can accept on the choke and bring it here. Okay, optimizing something I don't know the time that it takes to carry over here. Of course if I push too much on one side it just slide down and would not cross the line etc all this kind of things right or the car pole example that we were discussing in which. We had to move the cart in order to maintain the pole vertical. That's another example. Okay, only the difference here is that we don't have access to the full space state space of the system. For instance, let's say we want to optimize the car pole problem without knowing without having access to velocities just to position an angle. Okay. This seems like an ill defined problem. But that's not the case you can control this to some extent of course you will not probably be as effective as in solving the problem when you know everything. But in time as you all sort of make your observations you might approach a better behavior. So this in our diagram where we were discussing knowledge about. The precision over initial conditions or the measure the quality of measurement and the quality of knowledge about the system we are on the far right about the knowledge of the laws that govern the system. But we are much down as far as the knowledge of the actual state of the environment is okay. So that's the general framework. Of course this is a limitation because in most cases we don't know the laws by which a system evolves unless the system is very well known like in physics but in other situations you don't know the laws so we will have to discuss this next lecture. But for the time being the problem is that we have partial observability so we do not observe in full the state of the system. We just observe a subset of quantities or the ones that we observe we observe with certain errors that's the kind of things that we will deal with. This will be the first situation in which it's not only about computation or not completely but there is some sort of notion of learning from experience. Even though it's sort of getting in a very mild and moderate way. So in order to set the stage for this discussion let's first start with a very simple diagram that is the diagram of a hidden Markov process. What is a hidden Markov process? Well it's a sequence, it's a process which takes place as a sequence through states. And there is some probability distribution transition probability that governs the jumping from one state to another. So this is the sort of causal diagram for a Markov chain. You jump from a state s at time t to a state s t plus one according to a certain probability distribution. It's a graphical description of what a Markov chain is. Now suppose now that every time that the system is a certain state you do not observe the state itself. But the system emits some observation which we call y t and then again here. So at this level there's no action, it's something that is just taking place. You don't want to control this for the moment, you just want to have a look at that. And the only thing that you can see is this layer. These states are totally hidden to you. You just see a stream of data which comes out of your system which are the observations. So that's a typical situation if it's a system and there are some observers that you can measure that come out of the system. How can you obtain information about the state of the system from the y t's? What do you need to know? You need to know two things right? If you want to know. Suppose that you have made a string of these observations after a certain time. So you have a long record of observation y note, y one up to y t. And you're asking the question after these t steps, where do I think that my system is? You would like to have a model. You would like to have a model to infer this. And that's exactly the setup we're talking about because we have a model. We want to have a model for these transitions. And we want to have a model also for these observations. If we have that, we can successfully infer the state of our system to a certain degree of accuracy. But we can follow our system and that would be a question that would come up in due time. But I just wanted to give the insight. In general, it's a measurement thing. For instance, it might be a Gaussian error on top of the state. So you get some values according to some probability distribution. In general, it's not. If it were deterministic in one to one, then it would be just a relabeling of the states and that would be the same case as previously. If there are some non-motonicity, then it would become non-trivial. But we will go for the full thing where there's stochasticity and errors and whatever. So this is a hidden Markov model and it's an interesting theoretical tool in its own. Because you're sort of assuming that there are some probabilities. And then you want to, from the observations and the model of the observations, you want to know where the state of the system is. This is an interesting problem. It's so an important difference. But we want to do something else. We want to control the system. So we want to, with a knob from outside, move our particle, even though we don't know exactly where it is, we want to move it somewhere. We want to control the system. We want to keep the cart with the pole in a standing position, even if we don't know exactly something. We know just a small set of observable. We don't know everything about the system. Okay. So in order to do so, we will have to do more things, right? We will have to enrich this diagram a bit. So first of all, one important thing is that on what do we want to base our decisions? We only have access to this sequence of data. So the question is, does it make sense to decide what to do, how to act on the system with an action starting from this? What I have in mind is that, suppose that I do this. From my current observation, I take some action and this will affect my new state. And the combination of these two is ruled by P. This is one possibility, right? So at every step, I take one action based on my last observation and then I move on. Is this wise? Is this the best that I can do? You don't see? Okay. Does it seem, does it sound like a good suggestion to you? No, you're already lost in your thoughts. The suggestion was perhaps you have to keep some, all of the history. Okay, that sounds a good opportunity. So this particular situation has a name. It's called a reactive strategy that is your react on the instantaneous measurement. Notice that if the instantaneous measurement were the state itself, that would be okay because this encodes for all the previous history. And the problem is Markov and that's the case we discussed before. But if the observation is partial, this is just one observation out of many in the past and you don't want to throw away the past. So we are going to modify this diagram in the following way. So we want to keep the history. So in order to keep the history, whether in full or partially, we will discuss now, no later, but we will discuss this. Where do we have to put all these observations? Where is? It comes later on. It will appear sometime, but it's not yet. So where do we put our observations? A notebook? I don't know. Whatever. A record? Where do you put your previous observations? Memory. So that must be another set, another space in which we put our observation. And then, now, once we have some memory, we look at the memory and we decide what to do. So based on something which contains all the previous things, I will extend so that would be... So typically, the memory at time t plus 1 will depend on the memory at the previous time. And this is the process by which you remember or forget. And then you have to add the new information. So this is the way you update your memory. And this will be specified by some function g, which acts here. An observation is something that you do at a certain point. So at this time, from here, I look at the room and I have an image of the room. But 10 seconds ago, I was there, I was looking at the room from a very different angle, so I had a different observation. But I can't remember both this and this. This is the way you put your information into the memory. You're asking about this arrow, sorry. This arrow is probably... It shouldn't be here, sorry. Thank you. I was putting an arrow in the wrong place. Thank you very much for correcting. Was this the trouble? Thank you. So in this case, in this particular case, every time that it steps, step advances, suppose I'm not changing state, I'm staying here. But at the second time, there will be another observation which is typically understood to be independent from the previous one. So it's according to the same statistical law, g, but it's a different extraction. So the noise will change from time to time. I see the same thing, but with different error, for instance. You see what I mean? So your question is whether if the environment changes very much, then how memory can help? No. Can I only measure? Yes, but you measure it with some error. So if you make 10 repeated measurements of the temperature and the temperature is the same, you will reduce your error if you keep memory. Okay? So at this point, what we would like to do is not to decide on the basis of this memory. At every current stage, we have our memory. I didn't specify yet whether it's a finite set of bits or if it's an infinitely large memory. It's something, it's a support on which you put your observations and collect them and manipulate them somehow. So this g function expresses the way in which you put these things into the memory and how they change over time because of forgetting or, yeah. So this is sort of processing. If you just add something to the memory or you add with some forgetting, so this is the part of the error and it's containing this g function and also this part of adding new information to the memory is contained in the same g. It's a function which accounts for everything, for recording and deleting. It's everything that manages the transfer of memory from the previous step to the new step and the addition of new information. I will write down precisely what that is in a second. Just let's keep it a little bit formal and qualitative at this level. Now what we want to introduce now are actions and these actions will have to emanate from our memory. Remember in the Markov decision process your decisions were based on states but you do not have access to states now. So you have to base them on your memory and this will be our policy here. And then the combination of the state and the action through pi, through p, sorry, will generate the new state. And this process goes on. Additionally now you want to also add the action to your memory because also what you did is something important, right? It's not only what you observe from the environment but also what you did. So you see this memory keeps the record of informations and actions. It adds them to memory, perhaps it counts as them. Suppose that you have a memory which is a stack of, I don't know, length 10. And you might store for instance observation times zero, action times zero, observation action, then after when you come to step number five your memory will be full. So you might think that I'm just removing the very last, the most remote measurement that I did, a beginning, I remove it and I leave space for the new result and so on and so forth, right? So I erase the oldest memories and I put on the earliest one. This is one way of understanding what this memory does. And this procedure of removing things and putting add-in things on top of the stack is this g function in this case, sorry. The noise is inside the observation, okay? So the idea is the following. Suppose for the cardboard, right? So that's the example of the cardboard. Suppose you have access to all the coordinates but with errors, with Gaussian errors on top, okay? So the state of the system is specified by position, momentum, linear momentum, an angular angle and angular momentum, okay? Or the pole. But you observe all of them, each of these with the Gaussian error on top. Independent Gaussian errors for everything. So the y's are these variables with errors. The action is the force that you apply. So at every step you say, okay, I apply the force, I made a measurement with its errors and I store it in memory. On and on and on. Why doesn't the action, the action affects the state of the system. It acts on how the system evolves. If you use a force, it will drive the cart and then the observation will follow from the next state you obtain. Yes, it does. The action affects the environment through P. And then the environment gives the observation. Yeah, that's true if this object is a Markov process, right, when you can control this. But now we're going beyond the situation in which you have this and then we have to enlarge our space and we have to include the memory in order to make it Markov again. So this new process over this enlarged space, which includes observations, actions and memory, will be again a Markov process. It's only that it's not anymore a Markov process just in states and actions. Whether a process is Markov or not depends on whether you're encompassing the full space. So if you cut some degrees of freedom, your system will not be Markovian any longer, right? So this is the situation. This full thing will be Markovian, but if you cut it just at the level of states and actions, it will not be Markovian any longer because there's a memory. It's totally consistent. Okay, sorry. So let's put these things a little bit on a firmer ground so you probably get clear. So what are these functions we're talking about? So there is this probability of going from state S to S prime given an action. That's the same for them. Can you just hold it for 10 seconds? Then there will be... So this is the model of the environment. We already met this. This is the model of the observation. Here there is an additional arrow here for F because the action might in general affect the observation. So depending on what you do in your system, this might turn out to give a different observation. So if my action is I turn your back to the audience, my observation will be affected from that as well, independently of what the state of the system is. So this allows for more generality. Then we have a function. Was that me? A function g m prime my prime a. So this is the memory update. Given a certain previous state of the memory, a new observation and an action that has been taken. There is a probabilistic function and transition probability, which puts all this information into my new memory state. Might be deterministic, might be random itself. Because if it's probabilistic, it encompasses also deterministic ones. It's just errors of writing on memory. And then we have our policy by which we decide what action to take based on the memory. Now the process which corresponds to this complicated diagram is actually a Markov process. Why? Why is that? Because think about the following transition probability. So how do you go from a quadruplet of states, observation, memory and action at a given time? To a quadruplet at the following time. If we are able to write this as a probability, then this system is Markov in this large space. How does this thing work? Well, first of all, we know that we are, we know state and action that we've taken at the previous time. So from this, we know what the following state will be. And this is the transition probability that gives it. Then given the new state and the action, we know this and then we go there. Then given this and this, we know this. Y prime given the new state as prime NA. And then given the new observation, we can put it into the new memory using G. And then finally, from the new memory, we can create the new action according to the policy. You see the diagram I draw here is perfectly causal. If I start from my initial pair state of action, then I generate this thing. And then the new thing, the new observation will be generated only by something which I have already determined. And then again and then again. This is also the way in which you would implement this process on the computer. You start from a certain initial state, you choose an action according to pi, you evolve it and you produce an observation and then you record it on the memory and you go on and go forth. So this process in this complete space is Markov. Now if you try to marginalize over some of these variables, you will run into trouble. Suppose you want to forget about the memory entirely. You want to trace out the memory in your colmograph equation which is ruled by that. You will not be able to write down a closed expression in terms of the probability of states, for instance. Why is that? Because the action depends on memory. And if you wipe that out, the system will not be Markov in any longer. So and last thing, of course, we have the rewards, which as usual are issued depending on the triplet state action and state prime. So the question is always the same. Can I find the optimal policy pi which maximizes my cumulative discounted sum of rewards? Is this setting clear now? The goal of the program is always the same. There is a value function which is the expected sum of all rewards in the future discounted by time according to a discount factor gamma. The usual definition, the expectation of the sum of going from t to 0 to infinity of gamma t and the rewards that you will get t, a, t, s, t plus 1. So now this expectation has to be maximized over pi under the full process which is described by that transition matrix. You cannot decide on the basis of the state. Your policy does not depend on the state. You never see the state. You cannot decide on the basis of the state because you don't know it. You have to decide on the basis of the memory. The observation which is encoded in the memory. Yes? So this is the idea simply that the new observation will be determined by the action you took and the state in which you landed. It's a combination of the two. Just to allow for the possibility that there are some actions that modify the kind of way that you observe your system. F is just the observation that depends only on the state. Here. Sorry. The F is the observation. The dependence of the observation. If you remove this arrow, this would mean that no matter what action you take, you will make the same observation. But it's a useless restriction. There's no need to restrict yourself to this particular question. Some observation will depend on the action you take. So consider in a system where one of the actions is do nothing but observe the system. And the other one is do things and don't observe. Just this thing would require to have two different arrows here. Because two different actions concern the effect of observing or just closing your eyes. You want to allow for this generality. Yeah, this is something which depends on the state, which again is something that you don't observe directly. So the percept will be just the reward and you don't know how to attribute to the state, however. States are unknown, hidden. So if observations are perfect, what does that mean? It means that this F function actually rather than giving a generic observation gives exactly S. Then you put your state ST into the memory. But this is something which is totally superfluous. Because if you know the state, you don't need all the previous memory. Because the system is Markov itself here. So in that particular case where the observation is perfect and this F function is a deterministic function which gives exactly the state. All this part below is shortcut and then you revert to the usual Markov decision process that we were discussing yesterday. So this is really an extension of the notion of Markov decision processes to a case where the decision cannot be based on the knowledge of the state itself but has to rely on a memory support. Any other question? In general, we are. Yes, because the passage from state to observation is typically related with information loss. Otherwise, if there were no loss in information, that would be perfect observation by definition. Okay. So now what are we going to do? So this is the policy and these are the words. What we're going to do now is follow quite closely the steps that we've taken in trying to optimize the value function or the game here yesterday. But now in this different setting. So the first thing that we're going to do. So I'm going to erase this diagram. I'm going to erase most of the blackboard. Just keep that transition probability over there for the time being. Now probably just rewrite this transition probability up here, which is what we use in the following. Yeah, that's just because that's the previous step rather than following step. It's just the t t minus one definition doesn't really matter. Okay, so let's first start by by writing down the value function. So the thing that we want to optimize. What is it? So this is the sum of our all future times with discount factor gamma t of what now we have to evaluate at every step what what's the expected reward at every step. So how would this depend? So it will depend it will be r s a s prime, right? This is the triplet which determines what the reward will be the average reward. And if you know what s is then a s and if you know both s and a s prime will follow with the probability p s prime given s and a. But the action is determined by the policy. And so now we just have to specify what is the probability of having at a certain time t a given memory with a given with a given state and we have to sum over s. So let's go through that again. If I know at a certain time that I am going to be in a state s with a memory m. Then I'm going to pick an action a with this policy and this action a in combination with the state s will generate a new status prime and then I will get this reward. So this is a closed expression for the value. Notice that it only requires the knowledge of the at every time of the probability of being in a certain state and with a given memory here. So why didn't we write because there in this case the reward doesn't depend on the observation. But if I put I could extend this and put in y prime here and then I would have to add also the function. But I'm not doing this because that's not typically the case. If you wish it's also marginalization because in any way f is something you know are something you know and you could sum out over y prime and get a new definition of your reward. Bottom line is that it's not it's not it's something that you can do and it doesn't change the substance of the problem. Now this is also interesting because you you can realize now that the process which involves just state and memory is Markov. Even though this is a subset of the full space it still is Markov and we can show it just by writing down a photograph equation for it. So the probability that at time t plus one I am in status prime with memory m prime. What is that? Well I have to sum over all previous states and memories. And then I have if I know this I can determine the policy and this will be a and this will in turn determine my new state. S a and which will be this object eventually. Yes and that would be then an observation also f y prime s prime a as well which I can add and then there is a g m prime m y prime a. And all this is actually sum over let's put it this way there is a sum over a here and there is a sum over s prime y prime and stop here. Yes sorry not this prime because it's going to be there so this is sum over y prime and that's that's it yes okay. So this object here this just you can obtain from the photograph equation for the full process by marginalizing over observations and actions to be taken at the time a prime and then you get this same result. And this object here is actually a transition probability from a certain state and memory to a new state and memory according to an action a. And it's easy to check that this object if you sum over s prime and m prime gives one if you sum over s prime sorry if you sum over m prime first this will go away because this is a probability. Then you can sum over y prime if can you take this you can take the sum over y prime and this goes away and then you take the sum over s prime and this goes away. So this is a transition probability before proceeding further let's look at this expression so is it clear what I'm doing here from my previous diagram actually I'm showing that if you define appropriately. Your transition probability you can actually compress this diagram in into states and memory only. There is a compact description of the system we just accounts for states and memory. And this thing all this part. These were all parts of the model this these were things that we know we know we know this we know this and we know that so this is in the hands. Of the environment but we know it if we look at this equation now let's compare this. With the Markov decision process case in the Markov decision process you will remember that the photograph equation is just sum over s. Yes and then some over the policy. The action that you take and then the probability of s prime s a can you see the similarity if you just extend your space just not only from the environment but environment times memory. The structure is the same you see if you replace here s with s plus s and then. So if this depends on this state of the moment and the state of your memory and this becomes this sorry this is an s prime this becomes this this is this this is turn into this. This is the model now by which states environment and memory change because of an action and of course it depends on all the things that are in the belly of this capital T observations memory update but formally they are the same. So this is interesting because this would point to the conclusion that well then we can treat this problem. With partial observation in the same way as we did before. And we are going to transform our problem of optimizing decision making in an environment which is not perfectly observable into just a problem of computation any objection can it possibly work how we repeat this. So given the similarity one would be led to the conclusion well but then this thing the problem my problem with the memory with the partial observability can be mapped into a Markov decision process. Which is exactly like this only that now I have to work in another space which is the space of environmental states and memory state because the structure is the same. So for NDP I had a very similar expression. Okay. We just I had states and there was no memory but that was the same thing. Exactly. So that that sounds like a fishy conclusion right. That's not that doesn't sound quite right. Right. So now that so did you all spot the similarity. So is it clear what I said that to reveal you once more. You just have a more complicated larger space and more complicated transition metrics right and then the conclusion will be okay but this then can become just a mark of decision process in this enlarge space. Fine good but this doesn't sound quite right. Because where is the learning part what do I get information about the environment I. If my model of the environment is really of the observation is really crappy I get very very noisy measurements. How can that be that this can be mapped into just a calculation problem I have to put it somewhere where the experience. So now that you that you have seen the similarity I am asking you another question. Can you spot the difference. There is a subtle difference between this thing when I'm doing the extension actually it isn't the policy what's the difference in the policy. Exactly not only we only have memory here if we had this thing dependent on both s and m then it would be true but it would also be trivial because if you have the state you don't need the memory. So that's subtle difference is here is that pie depends only on memory and this subtle difference breaks down this argument that I was trying to make. So there will be differences popping out in this thing so it will be it will require more than just a mapping of this problem into an MDP in an abstract space. Eventually we will be able to recast this problem into the language of market decision process but it will require more more than this. So nonetheless it's true it's perfect it's a you know the state and then it's just the ordinary MDP like I said before this is an extension of an MDP so and then you recover the original problem when you have perfect observation. Okay in order to see where things go wrong and they have to go wrong somewhere as we said let's try and do this the usual trick that we did before so it goes pretty much in the same line. So we're going to introduce some function eta which is the expected time spent in a pair state memory okay before that so that this thing will be written as some t infinity sorry this sum goes away. This is going to be sum over s and a prime of this eta as n and then we can derive as we derive the equation for the eta of s for the MDP we can derive an equation for this starting from the chronograph equation also for this eta and the equation which I write here is of course eta as m. Once more this looks pretty much like the one we were deriving before except that once more pi depends on the memory only and not on the state. Then I'm not repeating all the calculations that we did because it's just the usual trick so we implement the dynamics of eta as a constraint. This will introduce a multiplier phi which will depend on both states and memory. We will add some convexity and forcing term which will be in the form of an entropy of the policy and let me just write the Lagrange function for this and then I will just give you the result of the of the derivatives we don't go through the calculation in full of course. The lagrange function is then going to be sum over s prime and prime so this is the Lagrange multiplier which enforces this condition on the eta. Then we have the normalization condition over the policies and the regularization term so this is pretty much in the line of what we already did. So let's write down here the relevant equation when we take the stationary point of this Lagrange function. I will skip the details again it's pretty straightforward the derivatives with respect to the policy. There's a whole part of this which is completely linear this is linear in the policy linear in the policy the only nonlinearity comes from the entropy. Which gives again log pi plus one when you take the derivative so the first variation over the policy that depends on the memory. It's going to give us sum over states of s and then the second variation let's write it for completeness. Okay and both of them we want to set to zero in order to find where the local and global maximum is. So once more here the situation looks very similar to what we had before but again with the small differences that here there is no s. So when we perform the sum over s this goes through this term and we'll produce a new object which is the sum over all memory state of eta s. Whereas on the other side I have just this as a factor so when you solve for this when you write down the equation and try and solve. What you will get is that the optimal policy this is just the result of the calculation there's nothing fancy over here. Where we are I have defined so let me erase all of this I have defined this eta s conditioned on m as the sun as eta s m divided by sum over eta s m this is sum over s. And I have defined the quality function for the state memory and an action as sum over s prime p s prime s k. So now the first thing that you realize is that there's a new object appearing here with respect to the previous case which is this ratio. What is this this is the time the expected time that you spend in a certain configuration of memory and state and this is the expected time that you spend in a certain memory configuration. So this object as a natural interpretation as a probability because it's normalized and we can actually actually think about it as the probability of being in state s given the memory m. So it's pretty natural that you see an object like this because you the agent does not know what the state is it just knows its memory state. Then this quantity basically encodes how the agent infers knowledge about the environment. It's not surprising that this thing now appears here recall if the if the system if the observation were perfect. Then this object would be a delta function and everything would become and you could forget and then everything would become again the usual expression that we derived yesterday for the Markov decision process. But now it's different there's a new thing popping out and you can't get rid of it. Of course it's derived from quantity from eta s m. And now that there is the second equation in which you can do something so you can do the usual work of plugging things back in and you can derive and as a close equation for phi which is the value of your combination of state and memory or for q which is the quality function of your state and memory and action. But this equation again we'll have eta inside. This is in stock difference with what we had for a Markov decision process there was nothing like that. And this is also interesting why because if there is eta in sites your equation then it depends also on row note. It depends on the initial state of the memory does it sound familiar what is the initial state of your memory nothing nothing is one particular state. The first observation is the one that you after you update but there's an initial condition for the memory. So this I said again I want to come to come from you. So this object will encode the probability that the system the environment is in a given state s given the memory and then inside this there is the notion of an initial memory and all this we say that it's something that looks pretty much like an inference problem. This is all very keen and in a second we will show that it's actually the same as what happens in inference you have a prior and this will be your initial distribution of the memory. And then this will be your posterior given a certain memory what is my the probability that I allocate to a given state of the environment so inference comes out comes about naturally. In this description of the process this is a little still a little bit obscure because we've been working with the generic memory our memory could have been just one bit which is not the usual way that in which you tackle with memory in inference. But in a second we will show what the connection becomes clear so. So. Still you end up with a couple of equations which are closed now it's not just only one you have in mdp you have at one equation the badman equation for the value function. Now you have two sets of equations one for the value function and one for the eta and this two are coupled together. But. You can use the same kind of techniques iterative techniques in order to solve for this problem. So in a sense. It's true because we've been able to map our decision making problem into. A calculation but there's a price to pay and the price to pay is that there is the appearance of a new quantity which encodes the knowledge that the agent has about the state given the memory. So if. If before we had to solve the value equation the badman equation for the value and therefore this was one point in in a real space with dimensions the number of states. Now we have to solve a problem in a space which is. The number of states. Times the size of the memory which of course if you want to keep a long memory it's going to be much much bigger. And if you want to keep everything in memory it's going to be infinite dimension but clearly. Keeping all the memory is the right thing to do because you can always discard information afterwards right. Of course there might be limitations for which your memory is limited so you might want to also to investigate what's the effect of the size of the memory it will affect the decision process. These are things which are even known in the psychology literature. So the size of the memory or which you work affects your decision and you decide differently to what expected for instance according to the expectation based on full information that's known. And this is a way of calculating this it's very cumbersome. And. Might be might make sense or not make sense depending on what kind of problem you have at hand but at least provides a very solid framework for all the discussion. Questions that's fine as well in this case that that's in this case in this particular case. Yes, and you can see it only by looking at the fact that when you remove epsilon in your variations the functional is linear in pi as as we did for the so to clarify to further clarify. What's the interpretation of this eta and how does it enter in the sorry is this this five is that five what sorry what are you talking about you should be more. Yeah, this one. I must have missed something here. Yeah, you're right. Sorry, I must have added here must have had it here. Yeah, that's right. That's not correct what I wrote in the notes. There must be some pie. Yeah. Yeah, there must be some pie. Right. Let me let me figure that out. Oh, sorry, I'm missing that this missing this horribly because this object here should be the T function of course that's the T s prime and prime s m a. Times this some dollar s prime and m prime. Yeah, apologies. Yeah, I written all this actually in the incorrect way. So here you should we should have the transition probabilities for the new market process. Sorry about this. This was T s prime and prime. Thank you for spotting this. S m a s prime m prime s m a. And then here we have given the sum on this and this is s and this is a spring. And that's some of her and prime here as well. Sorry about that. Apologies. I kept the I mean I mixed up the mutations for the for the mdp process and the and the other one. Thank you. Let's discuss about this quantity eta in the case in a particular case that we're interested in which is the case of perfect information. So let's forget about this for a second. And then let's go back. I will give you correct the notes. So the case of perfect memory is particularly interesting. So let's consider these objects. So this is the probability of being a time t with a certain state s and a certain memory. And let's rewrite this as in the following form. So this is the probability of being in state m of the memory. And this is the conditional probability that given the memory m the environment is in state s. So this naturally encodes the idea of the belief that the agent has of being in the environmental state s given the memory state m. Now from the equation from the probabilities from the equation the photograph equation for this quantity. We can derive an equation for the evolution of this quantity b. This is still true in general. So a time t plus one my belief is we call the belief of being in state s prime given given memory m prime is by definition. What t plus one s prime and prime divided by t plus one and prime. And then if I replace the photograph equation for this. I get sum over m and then I have another object which is the sum over s a y prime plus one s a. And this object is normalized as follows of the same thing as above. So this is a through reality is just rewriting things we are the things according to definitions. Now we make our crucial step and we say that the memory is perfect. What does memory perfect memory mean? It means that as we receive observations y and actions a we just keep them in memory. We don't process them. This is what in the old times was defined as using memory as a tape. There is a record to which you add things and does an infinite stock. You can always put out things. So at the beginning your memory is just a long stack of empty records. Then after the first step the memory after the first step at time one will be just the action that I've taken at time zero and the observation a one. And then all empty items. So void, void, void, void and this goes for a record. Then at time two I have a node y1, a1, y2 and again void memory. Every time an action occurs and observation returns then you just copy into the memory. That's what the perfect memory device would be. Of course the space of memory is infinite dimension now because you have to accommodate for all possible records that come in. In this case the g function is deterministic. It just adds items to your memory. You don't forget you don't make mistakes. So this means that this g function is zero unless m prime is just the previous memory with this new observation which is what I'm writing here explicitly. Fine. And in this case of course after t steps the probability density, probability distribution over all possible memories is just this is the indicator function which is one if what is in the brackets is true and zero otherwise. A watt of m being just the sequence a node y1 until a t minus one yt and then void. Everything is deterministic. At every step the memory is just in one point of this huge dimensional space of sequences. High dimensional space of sequences. So this means that first thing this object one this is the indicator function one of something of a statement. Let's say let's call it one of f is equal to one if f is true zero otherwise. It's one way of denoting the indicator function. So what happens to to this quantity when we what happens to this iteration when we have a perfect memory. So this sum goes out and picks just the memory at state m so that will be just this part of the sum. So this will go away and then you will end up with the following formula which I'm writing up here. Your belief at time t plus one of being in a state s and its conditions over the memory at time t plus one and the memory at time t plus one is just a node y1 at yt plus one. And it stops here and avoids after and this is the ratio of what of p sum over s of p s prime s given the action taken at time t f of y t plus one s prime at t. And then the belief at time t of being in s given the previous history. So in the case of perfect memory that's how the beliefs evolve. So this is the belief of being in state s prime given the following history of observations which I perfectly stored in my memory. And it's expressed in terms of the belief of being in another state s at the previous time. What is this formula? Pardon? It's Bayes theorem. This is Bayesian inference. So the way you encode naturally memory in this system when the memory is infinite and you don't lose anything is through the belief. Yes? Where's the pi? It does no pi taken here because it just you select your pick the action already on the right side either. It's everything is conditioned on actions up to time at which ours are already fixed by the memory. So this is not a is an AT is not a random quantity is fixed by the previous history. You have fixed all the memory. This memory includes the past action so it doesn't appear here. It's an issue about the way you order events. Okay. You might write other expression in which if you anticipate there will be the policy in it, but in this case you don't need it. And it's more transparent. Okay. So this is the way your belief evolves. There are two contributions to your belief. So this is the way you map alternatively map priors into posteriors given new experience. So this quantity here plays the role of the likelihood of the new observation. This is how likely your new new observation will be, which is the usual thing that you have in Beijing inference. But here there is another term additional one, which is this one because because of your action, the state of your system might change. So if an action changes the state, then your belief must be transported accordingly. And intuitive way of understanding this is suppose that I'm here, all right, it's sitting here. I have my eyes closed and I have a certain belief about the position of Matteo, which is in the back of the hall, right? So I have my eyes closed and then I think he is there with some error. Then I make a step. How has my belief changed? It's just been shifted, transported. And this is what this term does. It just transports the belief according to the laws by which my actions changes the state. So this is the transport of the belief, which is of course random in general, because the environment is stochastic. This is the way likelihood modify the belief. And this is the overall way you update the belief. So to finish this part, we just have to take the last step. You remember in the previous calculations, we had to do with this quantity eta of S condition on M. Now what does it become in this case of perfect information? Well, it's two lines of calculation to show that this object is exactly the belief. This is just because if you have a perfect memory in your memory space, you actually visit just once every state in the memory. You cannot go back on memory if you don't delete. Your memory will be a stack which becomes grower, grows larger and larger, so at every time you will have probably one of being somewhere in your record. F is just, so suppose that for a moment, just suppose that your system is not changing state, just like in the bandits problem, you just have one state. So this transition probability is always one no matter what you do. Then this belief, it's about some other property of course. And then in this case, you will just multiply this by the likelihood that the system is in that state S, in which case S prime will be equal to S, given the action. And this is the observation. So this is the usual part of this. So as usual, I mean, in base, you start with a prior, right? And then this prior is sort of pumped up in positions where your observations are more likely and depressed where they are less likely and then renormalize. Here, in addition to this pumping up of the belief of the prior, there is also some shifting or diffusing in the space of states. Okay, so it's been a long, long path, which actually brings us only to the definition of what is the currently accepted definition of a partially observable Markov decision process. So all this was just motivation, in fact, was just trying to show you the process of going from a Markov decision process where you have access to full observability, how this notion of inference and beliefs arise. So this process is sort of trying to condense in too much shorter time, what has been the process which took decades of thinking. And now I will give you the distilled definition of what is a partially observable Markov decision process. And I hope that this should sound natural to you after this discussion. Behind this notion of using the belief as a fundamental object, there is also an important result in statistics which says that if you have a string of observations and you assume a prior distribution for the state of your system, then the posterior computed according to the base formula is a sufficient statistic for the full history, which is another way of saying that you can compound all the knowledge about your past history exactly into one single object which is the belief, which I just defined. Okay, so this is the same content of what I just said before, but in more formal terms. It might be that belief are a useful way of compressing all the history rather than using a long record. You use an object which lives in a much wider space because it's a continuous probability distribution rather than a string, but perhaps it's more useful and compact. That's the kind of reasoning which actually brought to the definition of the partially observable Markov decision process. What is it? Well, you must have states, you also have actions, you have observations, but now rather than dealing with memory with this more general notion of support over which you write things, you preserve information, the POMDP immediately defines beliefs. The belief is a probability distribution from the state of space. So it's a function from S to R, it's B of S, which has the property that some overall states of B of S is 1 and B of S. So this is a belief. It encodes our, the agents expectation about what the future will be, sorry, about what the state of the environment will be, the current state. We have to specify two more things, you know, to complete our description of a partial observable Markov decision process, which are, yeah, of course, now let me, and like we said before, there are these quantities which define completely the process. I have to use the same definitions of Y prime, even S prime and A. Now there are rewards as usual. Okay, so this is the model of the environment. This is the model of the observations. These are the rewards. These are states. These are actions. These are observations. This is the belief, which replaces the memory now as a container for full memory. We have to define the dynamics. Before there was the memory and we had to define this function G, which was giving the evolution of the memory. And now we have to replace it with an evolution for the belief. And this evolution is just basic. Okay, so the new belief of being in position S prime is just like I wrote. It's the same formula that I wrote before in a more compact way. It's just sum over S of P S prime condition on S A times the observation model F of Y given S prime and the action A times the belief normalized. So this is basically the belief condition on the observation Y prime and the action A. So this is the base operator, which maps the new belief into the old belief, which is a vector, depending on an observation and an action into the new belief. It's a map. We can write this as, even better, we can write this as B prime equals sum operator of B, given the new observation and the action. So this is the dynamics in belief space, which corresponds to perfect memory. It's totally equivalent to the string. And then the last thing is to define the policy. So the policy was a function of the memory and now naturally the policy is a function of belief. So now this policy is a probability distribution, which depends on a continuous variables because beliefs are probability distributions. Normalization requires two summations. Thank you. Okay. So this is the policy within this definition. Actually, you can prove, so perfect memory encoded in the belief. You can prove that there is a Bellman equation for P O M T P, which is just the one that we've been discussing earlier. So it's going to be some value. This is the value for a given belief. So I have a belief. I have a probability distribution in my mind of how the word is and depending on that probability distribution, I have to make decisions. And if I am in that particular state of mind with my belief, there will be a value attached to it, which is the best I can get out of my process given that belief. And this object obeys a Bellman equation, which is very similar to the one we derived for the MDP and takes the following form. So this is, you will recognize here that this is exactly the same structure as the Bellman equation when you've taken epsilon to zero at the end. If you just think that now instead of having a state, you have a belief over states. So the value of your current belief is the maximum over all these things, which are the projected things and the function which has to be evaluated in the. Before it was in status prime. Now it's in the belief B prime, which is given by base load. So this is nice because we've been able after a very long the tour to map our decision making problem with incomplete information into Markov decision process at the price of working in the space of beliefs, which correspond to perfect information for perfect memory. This is the good news. If you are theoretically inclined, the bad news is that solving this equation is one or the remaining to complicate more complicated than solving the other one because it leaves on a continuum space. So you could try and do things like discretizing your belief space. Which as you can imagine, however, this is allows you to work with the final dimensional approximation of beliefs, but it's not going to be very, very good because your beliefs sometime if you get very good data will concentrate around particular states. So eventually you will lose resolution in your discretization. So you have to use something like adaptive grids. If this thing makes sense for any of you. Or people have been developing and is currently developing new methods in order to tackle with these problems because still there is a large variety of problems that you can discuss within this framework, even though it requires a lot of knowledge. Because at any of these stages, you need to know this and this and you have to make the assumption of perfect memory, which is also very strong. So it's tough, but it's a thing. Now, as an exercise, I will sketch you the solution of one rather simple decision making problem, which we will present both in its MDP form. So when we have access to states and in a situation where we have only partial observation. So we will solve the problem in both cases and see the differences and how it works. So it's a very simple example and we close with that for this. Did you receive the email from the secretary about the reference material? Yes. Yes or no? No. Okay, so please check with the secretary. One paper, one reference is the book about reinforcement learning in general. The other one is a review as specific on partially observable Markov decision process, on which you can find other examples and the more thorough description. Everything that basically does everything that goes from this definition on. Okay, you didn't receive. Ask the secretary. I send it to be forwarded to all of you. All right, so the very simple example is the following one. So there are just two states which are called one and two. There are two possible actions. You can take action two from state one, which will bring you to state two with probability one. And reward, reward R. Or you can take action a one from one, which will send you back to the initial state with probability one and you are, you will have to pay the price for this. Everything is deterministic here and making it extremely simple. If you pick a two, you will go to the other state and get a reward. If you pick a one, you didn't do it right, you will be sent back to the original state and you will have to pay a price for that. From two, you just have a similar thing. Only now it's action a one from here, which sends you back to state one with probability one and the reward and it's action two. That sends you back and you have to pay a price. So it's a very simple system. When you're in one, you would want to take action a two. But if you stick with action a two, that's very bad because you're going to stay in place and pay a lower price. And vice versa, if you start from two, the best thing to do is to take action a two. But then you don't want to pick, sorry, take action a one, but then you don't want to pick action a one again because it will put you there. So it's actually very simple. When you think about this system as an MDP, it's pretty obvious what the optimal policy is. Well, the optimal policy is if I'm in state one, I'm going to take action a two with probability one. And if I'm in state two, I'm going to pick action a one with probability one. If I do that, I will flip back and forth and always get the reward. That's the best I can do. It's very simple. If you want, you can write down the Batman equation for this in state space. This will tell you just this triviality, which you can inspect. Of course, you could add randomness and then the situation becomes more interesting, but we don't do that. So this is the best thing that you can do. Yeah, exactly. You don't want to stay in place because otherwise. But it's important. The important thing is that the action for staying in one is the one that sends you from two to one. Mind the labeling. Now we are going to turn this into a partially observable process. How do we do it? Well, we're going to say that the agent does not know whether it's in one or two. He has no idea what's up at the beginning. It could be anywhere. So states are there, but the observation is the worst observation ever that you can do. It's informationless. It's the extreme case of being partially observable. Can you decide? Well, in this case, even if you have no information about the state, of course, of course, otherwise that would be really, really tough. The only thing that he can have access to is the history of actions. But the observations are void of significance and the rewards possible. It's quite simple. Well, yes, in this case, that that's exactly what happens young that exactly what happens, right? So in this is this particular case. Even if you start with total absence of knowledge in one step, your belief will collapse onto the state function. Just with a single observation. And then from that point on, that will be a Markov decision process. This is sort of extreme. It never happens like that, right? But still, that's the thing. So now that you've got an intuition, the question that you can elaborate on to the observer, there are no differences. So at the beginning, the belief, the agent has a belief that rewards depend on the actions. Yeah, no, they're they're numerically the same. That's question. Yeah, but that doesn't matter me. It's a sign only that matter. So at the beginning, the belief is one half for each state. So this is the belief of time at time zero. Then the question, which I leave you as an exercise is to prove that the optimal strategy is exactly the one that I that has been discussed intuitively and to find the value as a function of the belief. So in practice, you have to solve the Berman equation for this POMDP. Your system lives in a state space, which is the belief of being in one of these two states. So it's just a line. There's one just one single variable, which is independent because the belief of being in the other state is just one minus its belief. So it's just something which takes place on a line. So this value will be a function on the unit interval, which obeys the Berman equation and there are just two actions. So the Berman equation is a relatively simple maximization problem, which you should be able to solve. And the result is exactly intuitively the one that has been given by your colleague. You can give it a try. And then you can come back to me if it works or not next Monday. It will be full learning. That is, what do you do when you don't have any model of the environment? Have a nice weekend.