 So, what the agent does takes actions and these actions have consequences on the environment. They change the state of the word. What does the environment do on the agent? Well, it sends signals to the agent. The signals are just, we can call them the states. This is inside all this arrow we are encompassing all what is the perception of the external word. What the bee sees and feels and smells. What we know about the external word before we decide to bet 100 pounds on Brexit for instance. This is all we know but it's not the end of the story. There's another thing that goes from environment to agent and it's crucial. It's also a reward. This is a very crucial subject. Action making is about rewards. It's a hedonistic theory. Agents wants to maximize something. It has a goal and this goal goes in the form of rewards which come from the environment to the agent itself. And in particular, as we will see in a more formal way in a second, what it wants to maximize is some long-term measure of rewards. So it's not happy with it getting the best out of it at the moment but in general it might want to get something in the long run. That's a typical situation. So let's get this thing formalized a bit. I will be as sloppy as a physicist can be but borrowing notation from mathematics. So the key idea goes under the name of MDP Markov Decision Processes. What are they? So the ingredients are simple. There are states. The states belong to some space which just fix the ideas. You might simply think that this is a discrete set of states. So the word can be in, I don't know, ten states and you number them from one to ten for simplicity. Of course, this can be extended to continuous spaces, helper spaces with all the difficulties that might ensue. But just to fix the ideas, it's just a set of discrete states. Then there are actions. Again you might think of this as discrete actions. I will do this or that or I have three options or ten options. Of course, there are some options you can take depending on the state. So the action I will invest 100,000 euros on that particular bond is feasible if and only if the world that is your bank account is in a state which allows that action. So this set of actions might be wider or smaller depending on the state in which you are. So this is sort of the structure onto which the actual process, the real dynamical process takes place. Dynamical process which is intuitively contained in this loop because the agent will take action, this will change the environment, the agent will sense the environment and receive a reward and in turn act again, et cetera. So how do you formalize this process? You do that with a Markov chain, that is. You define probability distribution for each state over the actions. It means that if you are in a state S, there will be a probability distribution of this possible action that you can take. This quantity is called the policy. It's a strategy. It's a way to map the state's interactions. This will be something that we want to make as best as possible depending on our specific interest. And then I have to model how as a consequence of actions, the world will change. And this is also described in terms of the probability distribution, which is called the, technically it's called the model of the environment. It means that given that the world is in a given state S and I choose action A as a result, I will end up in new state S prime of the world with a certain probability distribution P. There's two things together. Four models called the Markov chain in the joint space of states of actions. So diagrammatically in very compact form, there is a state. According to my policy, I choose an action. And these two things together combined, even what the reaction of the environment to my actions is, will give rise to a new state S prime. And then I will again iterate this and the next section and so on and so forth. So it seems a rather innocuous assumption that we're making here, but in fact it's extremely strong assumption that this one that this process behaves as a Markov chain. This Markovianity in fact is the simplest way you can think about it is it's fact that it's an extension of the notion of determinism to a probabilistic setting. What does that mean? It means that if at any given time, if at any given instant of time you know the probability distribution for your states, you will be able to predict the probability distributions for all future states. That's Markov property. It's a very strong property in fact and we will see later when and how it breaks down. So in particular it means that in order for apply to apply this kind of models to your decision making, you're assuming that the space of states is so large that every action combined with the state will give you an outcome in terms of probability that you can predict. So this is a very strong assumption. Of course as I said before, another important ingredient is the reward. There is a reward which in general might depend on the state, the action you take and the state in which you end up at the end. This is the more general setting. These rewards themselves might be random. So it might be that if you are in a given state, you take an action, you end up in a new state and if you try this thing several times, actually their rewards might vary from one time to another. That's possible but in the following we will be focusing on the case where this thing is actually in this case the average of those rewards. These are rewards. Then what is the goal of this process? What do we want to do as an agent? We want to maximize the expected value, I will tell you in a second what this is, of the sum over all future events, because we want to maximize something over the future, not over the past, that's the difficult part of it. Of the sum of all the rewards that we will get at all future time steps, and now I'm adding here a factor here which is called the discount factor, stays from zero to say slightly less than one. This discount factor is a measure of the horizon that you have in your future. Because you don't expect to survive by tomorrow, you will set your gamma to a very small value, because you will want to optimize that thing in the short term. If you have a very long expectancy, then your gamma will approach one. You may want to accumulate these things over very, very long. This changes drastically the kind of strategies that you might want to, as you can easily imagine, right? Why saving money if the word is going to end tomorrow, for instance? But if my expectation is that I will live on for 30 years, then I might want to save money. This quantity here is called the average return, and is the form of cumulative reward discounted over the future in some special way. This seems to be quite a difficult problem, in principle, because you have to optimize this quantity starting from somewhere, and trying to predict whatever can happen in the future. Clearly, you can imagine this gives rise to an enormous tree of possibilities. In order to evaluate this thing directly, you will have to check all of them according to a certain policy, right? This expected value means that you adopt some policy pie, and you will have different outcomes of your process, which is stochastic by nature, because these are probability distributions, and then averaging overall with these possible outcomes, and then you want to find the pie, which does the best job. This is a maximization over the policies, and the policy which does the best job is your optimal policy. This might seem a very, very complicated problem, in principle, to deal with, but a very nice thing's happened here, and this nice thing is the result of the combination of the fact that this process is Markovian, and that the quantity that you want to optimize is a sum of things that happen over time. At every step, you will make a new step, and you will add something to your cumulative reward. This sequential temporal structure is key to a result, which is central to the theory of decision making, which is called the Bellman's equation. The idea is very simple behind this Bellman's equation. It starts from the fact that for a given policy, you might split this sum into what you get at the first step, which is the immediate reward, plus the sum of what you will get after, discounted by gamma. You can actually unroll this sum and write it, this thing here, which is actually defined as, say, the value of a certain policy. You can write it in a recursive form, but specifically, suppose you start in initial state, which you want to call S. This will become a function of S. This is the value function called, its interpretation is that this is what you get out of your policy pi if your initial state, the state from which you start from, is S. It's a function of the states. This value function obeys this very simple and elegant Bellman's recursive equation, which states that the value of this state, just the sum of all possible actions, pi A S. I will write it and then comment it for you in a second. I'm repeating just for clarity. What is this telling you, is that what you will get from your policies starting from S is what you get on average from the first step, because you will be choosing with probability pi an action A, which will send you with probability pi into a status prime and will give you this reward on average. So this is what comes from the first step. In fact, if you set gamma to zero, which is just the one step horizon, that's all you get. You don't get the second part. The second part comes if you want to consider what will happen next. And what is that? It's just the fact that in this sum, you have taken out the first term, you take away the first gamma, and then you end up with the same sum, but now starting from a new point. So you see the recursive nature of this expression. And this is just an outcome of the Markov property and the sequential nature of the goal that you set for your problem. This is also the basis for identifying the optimal strategy. Because what you want to do now is take the maximum of this equation over all possible policies. And this will identify a new function, which is the best thing that you can get out with your optimal policies starting from a status. And if you do this operation of maximizing both sides, you get to another result, which is called the Bellman optimality equation. It tells that the optimal value of any state is the maximum of all policies. This is an equation for the optimal value. If we solve that, we also have the optimal policy. Because what the optimal policy does is it always tries to go from one state to another looking for the largest value of this function. So if I'm in a state S, I look around at all possible states where I can connect with my policy, and I will take the actions which brings me to the state with the largest V star, and so on and so forth. This recursive equation contains the solution to my problem. If I can cast it, and that's the point, if I can cast it as a Markov decision process, I know how to solve the problem. So I map my complicated problem of planning, making decisions, look in the future, far into the future, into an algorithm. Now it has become a computational problem to solve this equation. And the good news are that you can solve it in a very straightforward way. This general idea goes under the name of bootstrapping. Know what bootstrapping is? Once upon a time, shoes used to come with straps here to pull them on, especially boots, okay? So I don't know whether it's to use, but there used to be a say in English who says jumping over one fence by pulling each one ounce bootstraps, okay? So describe something which is clearly unfeasible. It should be able to pull your bootstrap and then using this force to jump over a fence, okay? But that's exactly the principle that is here. So you solve this equation actually by doing this thing. You start from a guess. That's your bootstrap. You start from a guess for your function V, and you put it inside this right-hand side, and you maximize. This is something which you can do. If I give you a V, you will maximize the right-hand side somehow. And then when you do that, taking the maximum, you define a new V. And you will plug that back in here, and so on and so forth. So the little miracle here that happens, it takes a while to prove it. But you can prove it as that this operation converges to the actual maximum of this value function. So this provides you a way to compute optimal actions and values of states, and et cetera. So yeah, there are technical assumptions that you might make on the, yeah. I'm not sure that this is perhaps too strong, but yeah. There are some assumptions, of course, yeah. You divide your state-spacing blocks, of course, that wouldn't work. Okay, that's the overall set of states, yeah, and actions, yeah. What do you mean there's cardinality, perhaps missing the question? The set of poles, these are probability distributions. So they can take any value in a simplex in this s-dimensional space. Yes, for each state, for each s, you have a simplex, and then it's cardinality of s times the simplex stress. Five is a real number. Yes, yeah, for every state, you might have a actions, and then there's a probability distribution over this a action for each state. So it's a cardinality of s probability distributions. Okay, so now everything has been wrapped up in the form of a algorithm, solving a algorithm. But then Bellman himself quickly realized that besides all possible foundational problems that I will discuss in the following, there is one big computational issue with this problem itself. This comes when you come to consider examples. For instance, think about chess. What do you want to implement on a computer, a solution of this Bellman's optimal equation for chess? For chess, the number of legal accessible states, it's n to the 47, big, big vector, okay? So in fact, that's absolutely not what computer chess players do. Yes, it's a different game, but just to highlight the complexity of this thing, just suppose you were to play in an abstract way in which there's a list of actions that are fixed by your opponent that doesn't react to your choices. Nonetheless, it would be impossible to do that, totally. I will not come to multi-agents because it's a long way to even to get to the end of single agents. Sure, it's a totally perfect question to ask. So these kind of approaches are actually not working for most real decision-making problems. Let alone when the space itself of states is continuous, which might be. If it's a robot that you want to operate, it might take any kind of states and actions in continuous space, okay? So, first one, what one would like is to reduce the dimensionality of this problem. So to map it onto a smaller space in which there are relevant base vectors, which technically are called features, and perhaps solving the problem into this subspace with limited number of features this might lead to a solution. Another way is to do it in a nonlinear way by using neural networks to approximate these complicated functions and use them to solve this problem, okay? So that has been a lot of work in the past years in trying to address these kind of problems from the point of view of function, approximation, and generalization, okay? Actually, first real good chess player from Tazaro was using, trying to solve the Bellman-Optimality equation with function approximation using neural networks, okay? So this was this combination of things that brought the first good chess player to life. Now, I will not dwell on this because there's something more urgent in my agenda than this is essentially is about the fact that this is not a very good starting point for decision making, real decision making. There are at least two big issues here. The first big issue is that it nearly never happens that the environment gives us as a signal its state. In a sense here, by saying that the system is Markovian, we are assuming a perfect knowledge and observability of the environment. But what we get, what animals get, is just cues. Very limited information about the environment. That's the first problem that I will address in the second part of the lecture. And then there is another problem, which we also, we don't know what the model of the environment is. This would be just like knowing all the laws of nature. If you know the states, you know what I do, this is Laplace's dream. If I give you initial conditions, etc., you will be able to predict everything in the future. But we don't know what the rules are. You cannot even guess what these probabilities are. So how do we deal with that? So there is a problem of limited observability, which is what kind of signals we actually get from the environment. And there's a problem of predictability, how accurate is our model of the environment if it's there. Just for clarity, I will separate these two parts. So I will first discuss the situation in which you don't get a precise picture of the environment. And then discuss, but you have a model of it, so you know how it will react, but just don't know what the real S is. Just like get a faint image, the shadow in the Plateaus cave. And then you have to decide nonetheless, given the partial information that you have about the environment. In this case, you don't know the states exactly, but you know how the system would react if you knew the states. Then a complementary situation, if you wish, is when you know the states, but you have no model of the environment. And eventually, the great synthesis of this is building up algorithms that are able to deal with both. So with partial observation and incomplete knowledge of the laws. So let's start over again, now assuming that we don't get from the environment all the full information that we might want. This will change radically the nature of our problem, because as we will see, it will not be just a problem of computing good actions, but it will be a problem of learning things, learning how to act. If you have any questions about this first part, I'm taking anyone, yes please. It's been a great success stories in the 50s. Then people realize that if you have large systems and then you say, okay, I will just map my system with the system with small number of states, then you get very, very far away from real good solutions. And this is something I will discuss for us right now. So what's the source of this problem? So the next step, and I will just use the same, many things that are already on the blackboard to optimize over time, is to go to another level, which includes Markov decision processes by this broader. And this new level is a level which goes under the name of partially observable Markov decision processes. So what's the idea? The idea is that there is a Markov process underlying, okay? So the word has its own laws, which are here, and I know them, but I don't know exactly where I am. Because the environment doesn't send me the state in which it is. It just sends me some observation. It's just like, suppose you were checking your bank account to decide if you wanna buy that flight or no, and you wanna check if your savings account is as money enough to decide. And then the bank replies, okay, you might have $1,000 with a certain probability and then perhaps $10,000 with a small probability. You get some partial information. And we have to decide nonetheless. Can you do that optimally? That's the typical situation which animals face. Animals decide, I would love to go and look for, so an insect try to look for a mate by smelling pheromones in the air, okay? And the location of the male perhaps is somewhere over there, but he gets just the odor out of it. And from the odor, which is a very mass information, it has to guess to learn where actually the female is. So very partial information. If he knew it, if he just knew it, if he just knew it, where the female was, then it would be a problem of optimizing the straight, the trajectories. So I would drive like this way, perhaps there's wind and then I will drive this way. Okay, so that would be a mark of decision process. But they get just very partial information and they have to deal with that. So how do you do when you're in such situation? Okay, so there are new things appearing here in this problem, some of them are all the same. There are states and there are actions. Now there's a new thing, so there are observations. And there's also a model of these observations, which is a probability distribution that I will make an observation Y, given that I end up in state S prime with action A. The choice of S prime is only a customary in the literature. It could have been S or S A S prime, whatever. This is a model of observation. It tells you that if I make an observation Y, this happens with a certain probability given that the word was in state S prime and I took action A. But this is still you need. And the goal is the same. But now the point is what is the policy? How do you define a policy? So I erased it exactly because the first choice didn't make sense. The first choice was to create a map which goes from states to actions. But if you don't know in which state you are, how can you build such a policy? So you have to change the notion of what a policy is in the context of partially observable market decision processes. And to address this issue, I will probably spend just five minutes with a very simple example of what a mark of decision process is and what a partially observable market decision process is to highlight the differences. And then we will get back to the basics, to the idea of what a policy is for a partially observable market decision process. So the example is very simple. It's a system which can be only in two states, S1 and S2. So very simple structure of the state space, just two states and very simple actions. So if I'm in state S1 and I take action A2, I get a reward plus R. This is a positive number, okay? Or I can take action one, which sends me again state S1 and gives me a negative reward, a punishment of the same amplitude for simplicity. But if I take action one from state S2, this will give me a positive reward as well. Whereas if I take action two from state S2, this will punish me again with minus R. So you see, the same actions in different states give different outcomes, because in one case it sends into itself, another case it sends you into the other state. What's the best policy for this system? You see, it's very simple, because it's not even stochastic. So you go from one state to another with certain probability one if you take some certain action, okay? So it's a very simple and you can easily guess what is the best strategy to take if this is a mark of decision process. So if you know in which state you are, the best thing you would do, if you start from state S1, what would you do? Well, I would take action A2, which will send me in state two. Well, I will take action A1, et cetera, and A2. The best policy is clearly, if I am in S1, choose A2. If I am in S2, choose A1. Because it's the same action, it just depends on which state you are. So suppose you have a screen in front of you and the screen may be green or red. And the actions are push the I or push the J. But if the screen is red and I push the I, I get one euro. If the screen is green and I push the I, I get minus one euro. And this will flip red and green. Because these are really the real same action that you perform, but it has a different effect on the state in which you are. It's an example which is a little bit cooked for the purpose of showing what I will be showing now. But it's clear that the best strategy is clearly, it's obvious here from this standpoint. So from S1, you take action two, and from S2, you will take action one. That's the best policy. Follow the Markov decision process. Now I will introduce a partially observable Markov decision process derived from this one. And this partially observable Markov decision process is actually very simple. My observations are awful. Every time I perform this process, I just get the same result. You are in state three. You are in state three. You are in state three. So you're not able to tell in which state you are. It's the worst case of partial observation. It's no observation at all. So the question is, can you act as best as possible in such a setup? So in order to answer to this question and to see how this can guide us to the formulation of the right way to act in the presence of partial observability, let's look at very simple strategies. The first simple strategy that you might think of is deterministic. You just take one action and stick with that. If you do that, it's clearly very stupid because if I take action A1, if I'm here, I will get minus R, minus R, minus R, minus R. And if it's A2, I will get plus R at the first stop and then minus R, minus R, minus R. So it doesn't really matter what gamma you take. If it's in the long run, this is gonna be a very bad strategy. So deterministic strategies in which you decide to stick onto one particular action. Remember, you don't know whether you are S1 or S2. These are very bad. Even, yeah. I know, that's what I said at the beginning. I know the models, but I don't know the states. It's just like, there are some hidden Markov model which I want to control in technical terms. There's a model which is Markov, I don't see it. I don't see the outcome of this model, but I want to control it nonetheless. So even the random strategy in which you pick by chance action A1 and A2, totally at random, does better than this. Because on average you get zero, which is better than the negative sum that I was getting a second ago. But there's something smart that you can do. Suppose that now you allow the agent to remember what was its past action. If it remembers what was its past action, then the best strategy that it can do, a good strategy actually that it can do, is do the opposite. Was I taking action A1? Next time I will take A2, and alternate between these two actions. In this case, even if you start with the wrong action, eventually we will enter in the right loop. So suppose here is the, okay, I'll restart with action A1. I will get the penalty, minus R. That's bad. But now I remember that I took action A1, then I will take action A2, which will give me plus. And then I will take action A1, which will give me plus, and then I will start alternating. Okay, so it's clearly conceived to show the example, but the basic idea, the basic take home message is that you can do very good things if you allow the system to remember what were its past actions. You allow the notion of a history. If it's able to learn from experience, to collect experience, it is able to do well, even in a situation where it doesn't have access to the real system. So the notion of a good strategy in this case, or of a strategy in general, is that you have to map histories into actions. So you will have in the past, I've chosen action A1. I have made observation Y1. I've got the reward R1. And then the second step, A2, Y2, R2, then AT minus one, Y. So this is all the history of the things that happened in the past. Actions, observations, rewards. You just keep a very detailed record of what happened in the past. And from that record, you decide your new action at time T. That's the way you map histories into actions. Now, this fact that you have accounted for history is very important because when you had partial observations, you completely destroyed the mark of nature of the process. Because it's just like having a process which is going to very high-dimensional space and you're looking at the projection of it. Of course, it's difficult to predict what will happen in the future if you have a very low-dimensional projection of us, dynamics which goes in high-dimensional space. So this partial observability breaks Markovianity, but the introduction of a history restores it. Because the process which goes over histories is always Markovian. You're always keeping track of all the past and accumulating knowledge as you go on and forth. You can imagine keeping track of all this long track of events is particularly cumbersome. So it's important to know that there is one way of compounding all this information about the past history into a more compact object which is called the belief. What is it? The belief is a very simple object. It is a probability distribution over states. So these Bs are positive quantities which sum the overall states up to one. And if you assume that you start with a certain belief which is your belief that you at the initial time are in a state S, you don't know it, but you can form a belief out of it. You might attach to every state of the world some probability that you actually are in that state. If you do that and you update these beliefs with your observations, so you use your experience to go from one time step to the following one and you do that by Bayes' rule, that is the new belief after an observation and an action that you take will be the probability of having made that observation given the new state and actually it's right. Sum over, don't mess anything here. That's okay. And then I have sum over S prime on this. This gives you the rule by which you will update your beliefs as you gain new experience. You take action, you make observations and you update your beliefs. That's just Bayes' rule for conditional probabilities. So this description in terms of beliefs which is that you have to carry over as you do your decision making process, this probability distribution is actually equivalent to keeping track of all the history. This thing is what it's technically called a sufficient statistic for all the history. But why is that useful? Because at this point, the partially observable Markov decision process can be remapped into a Markov decision process which now takes place in the space of beliefs. What does that mean? In the Markov decision process, we had a state, a set of states which I said were discrete like that and we had a policy which mapped states into actions. In partially observable Markov decision processes, there's not anymore this space of states. There is a space of beliefs which is a continuous space by definition. It's a simplex. I'll draw like this but it's in multidimensional space and then your policy is something which maps beliefs into actions. If you accept this mapping, you will be able to write down a Bellman equation also for this value function in the space of beliefs. The very highest price to pay because you've moved from a discrete space to a continuous space of probabilities. So actually the solution of this problem is even more difficult than the previous one. Nonetheless, the key message is here is that if you keep account of history, if you keep account of this in terms of beliefs, then you can approach this problem formally in the same way as you did. So you can define optimal strategies, you can try to compute them by methods, by certain numerical methods. To conclude this part, I would just probably mention a one specific class of this partially observable Markov decision process which has been studied a lot in the literature and for which there are very special results. These problems are called multi-armed slot machines. Suppose you have a set of end slot machines, you can play one slot machine at a time. They give random distributed rewards and you want to play to get the best out of it. One of these machines will have the largest average and you want to play and discover what it is and eventually play only that machine in the long run. This is a partially observable Markov decision process of a very special kind because your actions don't change the state of the word. Machines in this case will not react to your actions by changing the probability which is what happens in real slot machines, by the way. If you happen to win too much, they will downgrade you. But in these naive slot machines, you just have to discover what they are. But you have to discover this by playing and this is very important as an example because it sets great clarity what is the problem, the big problem of learning under partial observability. Is that you have on the rise of this kind of motors, you have to combine two opposing requirements. First, you want to exploit the information that you have about the system. So if you have experience that this slot machine typically gives higher rewards, you want to exploit this information and play that machine more frequently. But nonetheless, at the same time, you want to explore because if you stick with the best machine from the very beginning, you might be just out of bad luck losing the opportunity to sample other actions that at the initial time seems to be less good given the current information. So this counter is trade-off between exploration and exploitation is one of the key aspects that emerged from the studies of this partially observable decision-making process. This is something which is crucial. In all decision-making problems, if you strive the right balance between exploration and exploitation, you will be optimal. But if you miss it, you will fail miserably. And this comes from the study of this kind of simplified partially observable Markov decision process. So I could expand a lot on this. So if you're interested, I can discuss more about this at the end of lecture. But now I need to cut it short because I want to discuss about the other problem which emerges when you don't know how the environment behaves. Can you act optimally in that case? So this, the framework is the same again, essentially. As I said before, in order to simplify things, let's restore, again, the fact that the environment gives states and not only observations. But we don't know this. Our policy gets back to the initial problem. I have to determine it as a function of states. The visions are perfect, but I don't know what the model is. So this goes under the name of model-free decision-making. So the question is, can we optimize this target again over possible policies without knowing how the world will react to our actions, without knowing the laws of nature? The answer is yes. It's provided by a series of very important techniques which come directly from mathematics. And I will very quickly outline them to you. The very basic idea that's behind this method is that you want to compute these things without knowing the P. And if you don't know the P, still you can do one thing. That is, you can sample. Suppose that the world just gives you outcomes. You choose actions and the world gives you new states, but doesn't tell you how it generates the new states. Can you on the basis of just this information that you get the fact that I choose an action and I get reward and states only without knowing what the rule is? Can I do that? If you look at this, the problem is the same. I want to maximize this quantity. So this problem of maximizing the expectation of something which is random has actually been addressed in mathematics and again in the 50s by a large body of work that goes under the name of stochastic approximation methods. They are very simple in their idea. And to discuss them, I will just need to recall the very simple starting point which is the Bellman's equation. Do you remember? So they, in a nutshell, the main result of the stochastic approximation methods is that if you devise an algorithm which starting from a guess of your value function at the beginning, updates this according to this rule and you require that this alpha t which are called learning rates, that this sum diverges, but this sum converges, use this scheme to approximate your value function and at each step, you choose a policy according to your current value function allowing for some exploration. This algorithm will converge to the optimal solution with probability one. This method is called temporal difference learning. This is the state sd, the state that was previously visited. Thank you. So you update the value of the function of the state that you just visited. It don't update the others, yes. So there's a delta that I'm not writing here if you want to write it from the vector. Algorithmically, just every step you just keep track of the state, you just visit it and you update that. You can do something better as update many states at a time. It's just an upgraded version of this temporal difference learning. So you see that the good thing of this is that in order to do this, you don't need a model of the environment. You just need to receive rewards and to receive states. The fact that the system itself produces the states according to your probability distribution p takes care of it so that this will converge to the optimal solution. And there are different versions of this, including actions without actions, separating actions from states. Actions from states, yes. So there are many, many of these methods which go under the name of cure learning, SARSA, actor-critic methods. You can have many of them depending on the kind of architecture that you use but again, the key point is that this kind of algorithm, without knowledge of the model, model-free, can get you to the exact solution. The policy here is implied in the fact that you will generate the new state as T plus 1. According to a policy which is the best policy according to V to your current estimate plus some exploration. You can do this in different ways that was alluding to. You can choose from your value function you can construct different policies with different recipes. Each of these recipes goes under the name of actor-critic or cure learning, etc. All of them have to share just the single property which is they have to explore enough. And this requires parameter tuning. The downside of this is that there are parameters which you will have to tune to get your convergence as fast as possible to learning. This guarantees you that you will converge but the way to converge fast to the optimal solution but there's no real result on that for this kind of setup. The time to wrap up and attempt the synthesis so the holy grail of decision-making is to come up with a theory which accounts for all these things together. Lack of knowledge about the rules by which nature works and partial observability which I was discussing before. There are many empirical attempts at building such kind of algorithms but I think it's fair to say that there's no comprehensive description of the full-fledged problem of decision-making. That was all for the time and I'm happy to take questions if there are. Could you please pick up a little bit? Yes. In the belief case, even in... I hope that answers the question but if not just correct me. Even if you have a discrete number of states the problem of solving the Balmain equation in the space of beliefs is extremely hard. It's in the class of p-space-complete problems. If I understand it well we won't want to have continuous beliefs over continuous state-space. I cannot think of any other nightmare worse than that but people try to address this problem by approximating. So you're saying my space is perhaps represented by a set of bases and I will project my problem onto that and I will find approximate solutions but of course it's an empirical approach to the problem. I don't know of any result in such a... Yes. Yeah, it means that if you start with the prior it's an issue of distribution and then you update your beliefs according to the Bayes rule and you end up with a final posterior this contains exactly the same information as the string of all your observations. There's nothing more you can add nothing else that you can leave. That's the notion of sufficient statistics. Any observable computer starting from the belief will have exactly the same content of information of any other computer from the string of observations. Yeah, so in the Bayes model and in the insect search model the biggest source of uncertainty is in the partial observability because in that case the model of your actions is just that the Bayes has to be able to foresee that if she decides to fly in that direction after one second she will be one meter closer to this. So it's in how good she is in computing its position in space and we know that many animals are extremely good at that. So if I should point to the biggest source of uncertainty is in the partial observabilities the fact that they have to rely on a very, very unreliable signal such as odor in the field for that case. But there are different cases for instance so dragonflies are able to catch an insect which is flying while they are flying and they do so by intercepting it. So in that case they have the biggest difficulty is to build the right model of the environment which is the other insect but they are able to do so. So in that case it's different because they observe very well and they have very good sight so they sort of know what are the relevant coordinates of the system but the big thing is given that I see that insect seems to fly at that speed where will it be in one second where I will intercept it. So in that case all the way is knowing the right model having learnt it from an evolution experience. Yeah, actually one learns a lot from the bees and the monkeys and our brain. Actually for instance just to focus on this example of model-free learning there's one discovery which is most nice in neuroscience according to my taste is that in the brain there are neurons which actually compute this thing. So they fire in such a way which is absolutely compatible to the computation of this thing. So there are parts of our brain which actually implement this algorithm. They do temporal difference learning and react accordingly. They are part of the dopamine chain of reward and... Yeah, it was a two-way process by which people started thinking so there were models of this kind of... All these ideas come from operant learning and condition learning. So initially all this was Pavlov and people tried to formalize the observations of Pavlov by which dogs learn to respond to stimuli and even in presence of delay and then they got to neuroscience and then they got back to computer science and it was completely going back and forth. Just one example. Another example which I didn't discuss is about deep learning, which is something which is very popular. Deep learning is about the problem of constructing a representation of the world. So it's just like understanding which is the good set of base states for my space of states. So these ideas, which are now very popular in computer science, came from neuroscience just from vision. People were looking at the visual system, how the area in the visual cortex are organized and then try to export this into computer science and then get it back to neuroscience again. So I think it's a very permeable interface between biology and computer science. Well, in this case actually you don't need them to know what the shape of the reward function is. You just get it. It's going to be whatever it is. So I don't know if you're concerned about the... what are the conditions that one has to impose on this thing for this algorithm to converge, which are, of course, there must be some condition on the probability distribution of this random reward for this algorithm to converge. But in general, here we're not assuming anything. This is really model-free. The only thing you need to know is that there's going to be a set of states. Of course, this is going to be limiting. Again, you cannot learn to play chess with this because this vector is anyway to be 10 to the 47. So there's other limitation. Again, I tried to split the two problems into different parts just for simplicity. But the full problem mixes both of them. Absolutely. The way you schedule this thing and the way you fix your parameters, the way you transform this thing into actions all contain parameters that in this approach you will have to tune. There's no other way out. Also, the discount function, of course, it affects the speed at which you learn. Just think about the problem of search. The problem of search is a typical problem in which the reward comes after. I mean, you have to fly, you have to go around, and then you have to get to the female, and then, perhaps, you will get the reward. That's life. But then it's a very distant target in time. In that case, of course, you cannot expect to be learning anything with this approach unless you get there, the process ends, and then it starts a new trial, etc. Of course, that's why we believe that this thing doesn't work in the search process because they have to do it in one shot. And this doesn't work in one shot. You have to make many, many trials before learning. But if you have a model, you can go there in one shot.