 Okay, shall we start? So today we are going to explore as much as we can in one hour and a half. What we can do when we are up there, so when we have perfect observations about the environment and we know exactly what the laws that govern the system are. Okay, these laws might well be stochastic, right? So it need not be deterministic, but we know the in the mark of chain language, we know the transition probability for this. But before getting into this, which would require a little bit of calculations and we would in the end write a mean result and basically summarize in one hour and a half the work of many many people over more than a decade around the 50s of the last century. Let's go back to examples first in order to get an idea of in a very simple case of where we are in this diagram what are the things we're talking about. So I redraw the two main concepts that I want to use to orient ourselves in the problem of reinforcement learning in general. So like I said yesterday, there is an agent. The agent continuously interacts with the environment, gets as per seps a couple of inputs. One of this is the reward, the immediate reward for some action that has been taken while being in some environmental state ST. The environmental state need not be accessible in general to the agent, but it might get some observation out of it, which hopefully conveys some information about the environment itself. And on the basis of this it just builds some policy which is a way of mapping actions, mapping sorry. The history of all experiences gathered up to the time into some action. Okay, and the goal is to find the action that maximizes some discounted form of the reward. I will write this again for you. So the goal is to maximize the policies the expected value of a sum. Okay, let's say we start conventionally at time t equals 0. Then this will be the sum over all the future of gamma t times the reward at the subsequent time rt plus 1. Okay. Where gamma is this discount factor which expresses how far in the future you want to look. So this is based at time t equals 0. If the environment is changing, of course, this might carry on another formula which I wrote yesterday, but that's just intended as a reminder here. Okay, so and then we were discussing this example, which is a very simple example in which there's just one state. Now, since there is just one state, there's little to be observed here, right? So we are actually moving up above here, but still there's all these axes which has to be explored. And we will see what these things mean in any case. But first, let's identify simply in this diagram what are the concepts that are written here. So the agent is the agent here which decides what to do. This is the state of the environment which is always the same. So the environment evolves just like state, given state, next time the same state, next time again, again the same state with probability 1. Nothing happens. It seems to be a very dumb environment. Then what the agent does is the opportunity of choosing between two actions and you can select one of these two actions with a given probability, which would be pi 1 or pi 0 in this case. So this probability distribution and of course, since the agent has to do something, they must sum up to 1. So they're just one independent number here running from 0 to 1. It chooses action 0. If it chooses that, then two things may happen. So there's some random process decided by the environment on which the agent has no control, which will give either an outcome of 1 as immediate reward with probability p 0 and an outcome of 1 minus p 0 with probability minus 1. Otherwise, if the agent selects action 1, the same thing will happen with different probabilities p 1. This is just one simple example of the decision process. So you might think, and we're going to take a poll about this, is this a mathematical oddity or does it have to do with any, with some particular decision process which is relevant to us humans? So who thinks that this stuff is just simple mathematical example? I mean, both statements are true. But who thinks that this is prevalently a mathematical toy or new thing? So no, this thing is very interesting because I have an example in mind. So who thinks that this is a mathematical toy? Who thinks that this is an example of decision making for real? One among you, what kind of thing do you have in mind? That's very abstract. See where the bus, fine, good enough, that's a nice example. Okay, that's okay. There's a little bit of other complication in this example is that you have to gauge the value for time that you're waiting, which is not immediately mapped into this. Just one more example, lottery is the example, very good. So we will go through that example because that's the example in which this kind of problems actually were born. So for this example, I will need a volunteer. Please come around the desk. So I took two coins here from my pocket and I'm putting here one here and one here. Right, you are the agent. You have to flip one of these coins. You decide which, zero or coin one, you flip the coin. And then if it's head, you will get one euro. Yeah, sure. Just not for me. And if it gets tail, you will have to give me one euro. That's for you. Speak up. Question? No, let's postpone, just for a second question. So do you recognize that this is the same thing? Do you agree? And it's the, okay? He completely knows the state. He says, he completely knows the state. Well, the state, of course he knows. There's no state to be known here. Ah, you mean he knows the model. He knows what probabilities are to be flipped. Do you know this? What are the probabilities? Half. Why? Why? This is even more surprising than this. Are the coins fair? You cannot tell. I just put two coins out of my pocket. You're just assuming a lot of things about me. You don't know me. You will get to know me soon. Okay? Nobody knows. You can make an assumption. You can say, okay, I think that the coins are, okay, in that case, what would be your decision? Yeah, exactly. Yeah, which one do you want to flip? This one. Why? Pardon? I have no preference. You have no preference. Yeah, right. Because if you have an expectation that these two probabilities would be the same, there's no point in choosing one or the other. But that's an assumption about how the word goes. Another thing is, if I told you, so this coin is biased because all my coins are biased. And this is 60% in favor of being, say, tail. And this one is 60% in favor of being head. Of being, yeah, head. Which one would you flip? This one. Given that information, you would choose that one. Okay? Does it make sense? Yeah. Now I recall you that not later than yesterday, at some time, some people propose that you have to choose proportionally to the probabilities. So does it make sense? Should I flip 60% this one and 40% the other? Okay? So that answers the question. And we will see how it comes from. So is it a general lesson that given the full knowledge of the environment, so given this P note and P1, the best strategy is always deterministic. Yes, I hear a resounding yes. So who thinks, yes, raise your hand. Given that you know the probabilities. Good. Who thinks the opposite? Okay, very good. Enough for spending half an hour on proving that. So, okay? If we don't have that, for these kind of problems, unless there is some degeneracy, that is, you get the same thing, in which case that's indifferent. But if there's, this degeneracy is lifted and there is actually is a way of getting the best result out of this, then it will be obtained by deterministic policy. Or at least it can be always obtained by the deterministic policy. Okay? That's true when you know everything. So when I tell you, this is 60%, this is 40%, we are staying. Thank you very much. I will call you back again soon. We are here, right? We said we are at the top here because there's nothing to be observed. And we are on the far right because we know the laws that govern our world. And the laws that govern our world here is the nature of the coins. How much they are, how much they are biased. Is it clear? Okay, then let's remove this kind of gracious information that I gave. And we are back one step and say, at the beginning, you don't know. You don't know. But are you entirely ignorant about this? Now you know something. You know that this is nonetheless the Bernoulli process. Because the outcomes can only be head or tail with a certain probability. So you might not know the values of the, sorry, the values of these probabilities, but you know that the structure of the process. You know that it will never come to, if I flip it, it will never be dragon. Okay? It can be head or tail. Okay? So you know something about the structure of the model. And you have to guess what the parameters of the model are. And if you get information about the parameters and these you do by experience, you can improve your way of making decisions. This is what happens when you are in the middle. You know something about the environment. That is, these are coins. They are not expected to fall exactly in this way. I will never be able to do this, of course. You don't expect any time to happen, this to happen, right? It might, in principle. But you are attributing zero probability to this event. So in which case the process is Bernoulli and it's just an assumption. And then you have to understand what the probabilities are, right? So you are here in the middle. And then there is, if you move further, further here on the right, there is the extreme. The extreme situation is the one in which, for instance, I give you the coins. You don't know anything about it. You flip the coin, you get some result. Then I put the coins again in my pocket. I pull out one pair of coins. Might be the same one. Maybe not. This is the case of a rigged lottery. Rigged. You know what rigged means? It's cheating. One way cheat, okay? I might pull out, if you win, I might pull out a different kind of coins and then adjust to your decision to make this against your will. This will be, so technically it's called an adversarial lottery. In this case, you know very little about the environment because the environment is changing as you do things against you. Can you still win at this game? Are there algorithms that do that? You can try. Yeah, exactly. That's one way. But you see the complication with that, right? So increasingly, if you move from this side and this, this, I hope it's clear for you. If I give you the probability, it's a matter of just computing. Right? You have to use a paper or intuition in this very simple case and then you will make a decision and the decision is correct. But there's no learning in that. What did you learn? You could have guessed this without flipping any coin. I told you, this is 60%, this is 40%. Nobody flips any coin. You know already what to do. There's no learning in that. When you know that it's Bernoulli, there has to be some learning because you don't know in advance what the probability will be. You might say, okay, I think that he looks like an honest person, so perhaps it's 50-50. Bad guess. Bad guess, but for you. But experience will show you that I'm not that honest and the coins are biased in some way. And then you go to the other case where you don't, not only you have to learn the probability with which these events comes, but the process which generates this ever-changing probability, which is an even more complicated task, and requires a lot of learning. Okay? You know something? Exactly. You know something? You still know that it's Bernoulli, but this probability would be changing every time, depending on the past history as well. Okay? So again, you know something. There's an even worse case in which, again, it could have, anything could happen. The coin could disappear while in flight. It could end on a edge. It could show dragons. If it's a, if this lottery is dematerialized and then it's just the machine that you find in a bar in which you press a button, well, you put a lot of confidence on that that it will always show you the same. Results, right? But you don't know. You don't know. Okay? So this is far on that side. Okay. So these kind of decision problems are actually extremely famous. And, and, and besides this very simple example that I gave you, which is the one a lot, are actually very, very serious and interesting applications. I would just mention a couple of them. Okay? So one of them is, you should be familiar with that. It's advertisement on, on internet. So in these days, if you click on a Wikipedia webpage, you will get a banner which says, please donate. But doesn't say always the same, doesn't give always the same text. Okay? There are several of these different ads with different texts. So this is exactly the same process. Suppose that there are just two ads. Now the agent is Mr. Wiles. Is that his name? No, I don't remember. Well, Mr. Wikipedia. Okay? He is the agent. And you have, the agent chooses whether to send out to you who are the environment now. You are clicking on the Wikipedia page. You are the environment. You are sent banner zero or banner one. And then with certain probability, you will donate usually very, very small, okay? Otherwise it just, it will return zero reward for the agent to Wikipedia, right? Then what, there are algorithms that take very seriously this kind of process and decide online which kind of banners to propose in order to get the most out of you. So the one which is more efficient is discovered by trial and error. So you don't know in advance how much a banner will be effective. Okay? So it's the case like you don't know what the probabilities of the coins are. But you will repeat this game and you will rent out to do it. And of course it's small amounts multiplied by millions if not billions of clicks, which make out a substantial amount of money. You have to discover and click many times on Wikipedia page and notice that, okay? There are results about this, about past campaigns. And I have figures for that if you're interested. So typically the amount of money that you get for every click is minus Q. It's, I don't know, 100 percent of a dollar. It's software because there's not an immediate reward in that case. It's a little bit more complicated to map. So it's unclear how much YouTube gets advantage directly from that. But yeah, you could assimilate more generally to this, okay? This is a more direct example. The second example which is even more serious is how do you make clinical trials? Clinical trials. So you have some treatment for some illness and you have to check whether it's working or not. And you have to do this by comparing the treatment with the placebo. The placebo is something that looks like but doesn't do anything, right? So in that case, you have two options, placebo or treatment. And the outcome will be how effective it is on the patient. And you have repeated trials to do. And patient come in with time, okay? Sometimes you have a cohort or large cohort available, but for some rare disease, you just have one patient coming and then another one in three weeks and then another one in five months. So and as they come, you have to learn how effective it is. Because you don't want to wait 10 years before discovering that that treatment was actually extremely good or maybe had a negative outcome. So as this data come in, and this is again the same process. Now all these kind of process actually go under one name and the name is comes again from the lottery jargon. These are called multi-armed bandits. What does that mean? So this is the name of this kind of process in which there is one state and many actions to be taken. There is an obvious generalization of this with zero, one, two, capital A minus one actions. And then the rewards might be Bernoulli or whatever. Gautions, okay, it doesn't matter. All this class of process go under the name of multi-armed bandits. They are part of a class, bigger class, which is called sequential allocation problems, which is even wider and it's an enormous field of research for operations research. So for decision making in factories, in logistic, it's a huge field of mathematics. So what does that mean? What does this have to do with this simple model? So who knows what a one-armed bandit is? Have you ever heard about a one-armed bandit? No, sorry? A slot machine. Why is that? Yeah, slot machines have one arm and you have to pull the single arm, right? These are old slot machines, mechanical ones. You pull the arm and then if you were lucky, coins would pour into your basket. Why a bandit? It gets away with your money, yeah, right? That's the reason. Multi-armed bandits are just bandits with several arms, right? So this is a two-armed bandit. You can pull arm zero or arm one and then with a certain probability, you will get a coin or just, suppose you put, yeah, you get one euro or you lose the euro that you put. Whatever, okay? You can clearly see the analogy, fine? So these problems in their simplicity are extremely interesting. Still the subject of open research. Many counter-intuitive results on this. This will be our workforce number one. So we will come back to this problem repeatedly. Today we will treat it in this corner. Then we will discuss what happens here and then we will discuss over there. It will accompany us because it's very intuitive, it's simple, but it's very rich and not trivial at all in practice, okay? So before getting back again to this problem, which we'll discuss in greater detail, I would like to give you a few other examples of decision-making processes in which the exercise is to understand what the things are here. So what is the environment? What is the agent? Okay, what are the actions to be taken? So I don't have much time to do all the examples, so I will just focus on a couple of them. One is taken from, actually from physics. So this is example number one. Example number two is called the cart pole. So I'm discussing briefly qualitatively this example because this will be the one over which you will be able to train your deep Q-learning algorithm on next Tuesday and Wednesday in order to understand what is the best policy, okay? You will be running basically what is AlphaGo doing on a super trivial mechanical task like this one I'm describing, okay? This is just to avoid the complication of what you will discover. So what's the cart pole? The cart pole is a very simple task. It's, there is a cart and on top of this cart attached to a hinge, there is a pole, a stick. So the task is to keep the pole within a certain angle. Why is that difficult? It's only hinged here, right? There's a hinge. There's just a hinge at the bottom. So why is it difficult? It's unstable, right? It doesn't stay like this. It will fall down on one way or the other. So how can you keep it in place? Suppose it's falling this way. What can you do to keep it in place? You can move the cart and recover, right? So it's a balancing act. It's what you would do intuitively. This is pretty stable because water is on the bottom. But if it goes this way, I will try to compensate, right? This kind of thing. So it's possible to cast this cart pole problem in this framework. So what's the state of the system for this thing? What is the state of the system? Good. This is one degree of freedom. So the state is described by, let's say, the angle theta that the pole makes with respect to the vertical position and velocity of the cart. Position is enough. Someone proposes angular speed, right? Okay. So I would settle for this within the Newtonian framework, right? So it's just a system with two degrees of freedom, angle and position of the cart and the associated momenta, which are the angular momentum and the linear momentum. Of course, it's not this one. Of course it isn't. It's a situation where there are laws of motion. So we are not here at the top. I mean, I'm including it. Okay. That's a good question. That's a good question because this poses the question, what are the actions that you can do? So the point is making is right. If I could act directly on the pole, okay, if I could just decide what to do with this, then I would have just to control the angle. Okay. That is that what you have in mind, right? So I didn't specify that you can't do that. But what you can do is just act on this variables. I would say that in order to describe what will happen in the future, you would need position and momentum and linear momentum, momentum velocity, horizontal velocity. Why not position? Okay. Let's say, let's settle for something we all agree on. So is it okay for you to describe this mechanical system with these degrees of freedom? So there might be some situation. I will get to this comment in a second. There might be some situation where you might have a more compact description of the system, which might apply. Okay. I will ask directly to your question about position in a second. Then the actions that you can take, like I said, is the force, is a linear force which acts only on positions, okay? So the actions are force on x. So your dynamics will be x dot equals v, say n is equal to 1, and v dot equals f. And then there is the reaction of the pole on the cart. And this will be the equation for the cart, okay? You will see all this in detail in the example, right? So this is the kind of actions that you can take. Now, what's left still to do? So this is the state of the arm. You have to define the percept. So rewards. So rewards can be, for instance, you get a reward every time that you are within that angle. And when you get outside of that angle, you get some penalty, for instance. In addition to this, you might want to put boundaries on this. So every time that you move, when you hit on boundaries, you stop the system, for instance, right? And then you get a big penalty for that. In this case, position obviously arises. You see, there are many ways of shaping this reward function for this, okay? So what are observations in this case? What? You look at your system. You might be able to measure position, angular velocity, linear velocity. All these things, you might be able to measure them with arbitrary precision for all practical purposes. Or it might be a very coarse description. Or you might say, okay, I want to control this system, but I only look at the position of the angle. I don't care where the cart is. This might be good at some point, but you will hit on the boundary, for instance. So it's not, it's a very incomplete information for that part of the problem, which made us say, I have to keep the cart within a certain range of distance. You see? The only exercise I want to do now is just that you feel at ease with this description and say, okay, it includes all this. I think it makes sense. I can work out some specific instances, in one case or the other. Of course, if you change slightly the definition, the task will be slightly different. But what you would expect in practice, for this system to do, is just to stay very close to the vertical position. The percept is just a couple of the rewards that you take. The one that I was discussing. You get a positive reward if you are within that angle. And you get a negative one if you're out. You get a negative reward if you bump into the boundaries. And the percept is the outcome of your measurement of all this state, which might be coarse-grained observable of this. Say this plus Gaussian errors at every time. This might be the observables. Or it might be, as well, the state itself. Perfect observation. If the observation is perfect, yt coincides with st. You know everything about the system. Good. Any questions? Third example, it's called grid world. So this is a very simple example of a navigation problem. So there is some notion of space. Say a square grid, square two-dimensional grid. Like, can you see? Can you see? Yeah. So I'm drawing quite simple. It's a grid like this. Then there are some points on the grid which are sort of cancelled out. There is some point in the grid which is a starting point. Grid world, world, world. There is some starting point and there is some arrival point T. Capital T. So the agent is a worker which can jump on agents and sides of these lattice. It can make a move from here to here, or from here to here, or from here to here. So what is the state of the system in this case? Is the position of this agent on the chess board, if you wish, on this board. So if the agent is here, this is the position of the agent. So capital A is very bad because it looks like an action, but this is the position of the agent. So if I say this is ST, it means that the agent is in position two, four on my board. This is the state of the agent, which is also the state of the environment. It's the position of the agent relative to the environment. Then what are the actions? So the actions might be north, west, south, east. That is, the agent decides that it wants to go in up, left, down, or right, whatever you prefer. So there might be four actions in this system. Then the outcome of these actions might be random. So the agent might, for instance, say, I want to go north if there's a loud space. Suppose the agent is here and says, let's put it here. It's here, it wants to go south. It can, but it goes there just with the probability, say, one minus epsilon. And with probability epsilon over three, it chooses to go in the other direction. So there is an error in the actuation of the policy. If an action is taken, sometimes there's an error and this little robot cannot make a step in the direction he wished, and maybe it goes this way or this way or back. Or it stays in place, whatever. So this is how the environment changes as a result of the action. And this probability, this transition probability from one position to another is our model of the environment. These are the rules, the laws, which are obeyed by the environment under some action. Is that clear? In this case, what are the rewards? For instance, it might be that the rewards are no reward unless you sit on the state t. In which case, you get a positive reward. So the rewards depend on the state on which you are. Observations, well, the agent might know very well where it is. Suppose that it has some sort of GPS system that tells you, okay, I'm in position two four, or the agent might not know where it is in absolute space. So this is changing the quality of the observation. If he knows it very well, it will be up here. If he doesn't know it, it will be down there. So what is the final task? The final task with the reward that I just described is to get there as fast as possible and then stay there forever. This is the best thing you can do. So eventually in a problem like this, if you train well your algorithm, it will be able to find the shortest path, the shortest admissible path to reach the target and stay there. May know, may know in which case you're up here or may not know. It might discover it. So that's how the algorithm could work, right? You discover where it is and you might know that you get a reward or you might know where the target is upon first discovery and then get back in the second trial. You will be already able to do this or you're in that case in the upper corner, you know everything. You know this, you know all the probabilities of jumping, given the actions that you decide. If you've given everything, again, it's a computational problem. Much more difficult than this one, of course, but still the goal is to understand how to compute these things. I hope this sort of covers many, many examples of this, of course. Actually, it's more difficult to find situations which you cannot really cast in this framework. The environment is the grid and the state is the position of the robot with respect to the grid. It doesn't know. It doesn't know. You're here below. The environment is still the same. It's just the observations that you get, the percepts that are very, very bad, but you will have to decide based on the percepts, not on the state because you don't know it. So that's why the problem is more difficult. It's just like navigating, but you're blind. Good. So let's get back to this simple problem. And we just have to do some very simple. So we had this intuition with the coins that the best action to take actually is the one with the largest p. If p1 is larger than p0, then take action 1 always, then vice versa. So this example is so simple that you can actually work out explicitly the calculations. So that's what we will do. It just take a couple of lines. So we are going to evaluate this thing for this simple model. It's just a couple of lines. So let's start with the first step, t equals 0, first step. First step, gamma to the power 0, no matter what my discount factor is. Okay. It would be 1, the pre-factor. And then I have to take the expectation of my reward. So what do I expect in this case? I said with probability pi note, I will take action 0. So the first term in that sum, let me call this capital G for a second just for compactness. So with capital G, okay, first step, I take action 0 with probability pi note, which is my policy. And this is one thing I want to know how much this thing depends on the policy. So this is a function of the policy. And I want to optimize over the policy. So pi note, then what can happen? It can happen that with probability p note, I get plus 1. And with probability 1 minus p note, I get minus 1. Otherwise, I pick action pi 1. And with probability p1, I get 1. Plus 1 minus p1, I get minus 1. And this was the contribution from the first step. And then I'm back here again. So I am locked into this cycle where everything gets back again to the same starting point. Next step, well, first of all, now I have to discount. So time has elapsed. So the reward, the gain that I will get from this new round of action will be discounted, will be smaller by the gamma. But apart from that, it will repeat itself. Everything will be the same. I'm fresh on you, and I do another round. So this will be exactly the same. And I'm rewriting this, this minus this, it makes 2 pi note minus 1. And then I have this, but this pi 1 is nothing but 1 minus pi note. And this is again, sorry, what am I writing? This is pi note, 1 minus pi note, yes, 2p1 minus 1. And so on and so forth. Plus gamma squared, the same thing. Are you okay with this simple calculation? Yeah, pi note is the probability of picking action 0. And pi 1 is the probability of picking action 1. And they have to sum up to 1, which is what I used here. All of this. There's a whole single Markov process, which depends on probability distributions for these events to occur, and for the probability of actions to be taken. So every time there's a policy which I decide a probability distribution, I flip a coin, or I throw a random variable, and I decide, am I going to pick action 0 or 1 with probability pi 0 or 1 minus pi 0? And then the environment X randomly returns me plus 1 or minus 1 with probability pi 0 or 1 minus pi 0, and so on and so forth on the pool process. Okay, so all these terms will be the same. This is just, this happens just because you always end up in the same state. And the only thing that changes is this discounting for future. So when you sum up all these terms, it's 1 plus gamma plus gamma squared. It's a geometric sum. So that eventually your final result is that, let me, okay, I'll write it here. Your final result is that how much you can gain on average, given a certain choice for the policy, so given a certain probability distribution, is 1 over 1 minus gamma times. Now I'm again 2p1 minus 1 plus pi 0 2 pi 0 minus p1. So this is again the same thing. I just rearranged the terms here. You can check that that's exactly this. This clearly depends only on pi 0 because pi 1 is locked to be 1 minus pi 0. And now you can ask the question, what is the best policy? I have to find the maximum of this quantity. Overall, pi nodes that clearly, since these are probabilities are confined between 0 and 1. So what, where is the maximum? Where is the maximum? You make the derivative. Yes, it's a linear function. You can also go through for that, right, without making too much analysis. Where's the maximum? So this is a linear function as of pi, right? So it can be either like this or this. It depends on what? On the sine of pi 0 plus pi 0. If pi 0 is larger than pi 1, you will have to put pi 0 to 1. That's the best thing you can do. Otherwise, you will put pi 0 to 0. Are you okay with this? So max over pi of g pi is if p node is larger than p1, then I'm going to put a 1 here. And I'm going to get 2 times p node minus 1 divided 1 minus gamma. Otherwise, 2p1 minus 1, 1 minus gamma. In this case, the best action, the best policy is to choose pi 0 equals 1. That's the best policy. And in this case, the best policy is to choose pi 1 equals 1, okay? We're just computing the thing and showing that our intuition was correct. You always have to play the coin, which you know in advance has the best probability of giving you a positive reward. It's a pretty straightforward calculation and it's totally useless. Because as long as you go away from this simple example, it's a mess. But I'm showing this just because you can do it and then it's fair to show it. But I'm going to give you an example now of a slight change of our problem, which already makes it pretty cumbersome to calculate directly this optimal policy. And this variation is actually pretty simple. So it's a situation in which, so let's go back to the demonstration. So these are the two coins here, left and right. So this was the previous problem, this one. Now I have another pair of coins here. I had to buy many coffees this morning in order to get all the coins for this. I'm putting them on the other side, okay? So these two have different probabilities in general, right? So this might have p-node, p-1 and these two might have two different p's. Say p-prime-node and p-prime-1. Then the game is slightly changed and then if you keep these two systems decoupled, okay? Suppose that at each step you cannot play both of them. So you have to be either here and flip this coin or there. Right? And then the game is the same for each two. So you always get plus one or minus one depending on if it's head or tail. But now you have to pay a cost, you have to pay me 20 cents if you want to cross this. There's a bridge and I'm here. I want a 20 cent to move on the other side, okay? So now the complication is that if you are here you might as well know what to do. If you know in advance what the probabilities are there, you might also know that this is more advantageous to you to play over there. But you will have to pay a cost to get there. So is it worth? It depends on many things. So one thing you might have noticed from here is that this result does not depend on gamma. The outcome depends on gamma. But the policy does not depend on gamma, does not depend on the horizon. Why is that? Because you always back to time zero, right? It's just like a Bill Murray movie that is always in the same place every day, okay? Same time. But this is not generic. This is the extension. Now this system has two states. There is the state A and the state A prime, each of which has its own probabilities and in general rewards. And then there is a transition which has to be made between one state and the other. And there is a cost for doing that as well. So this is summarized in a simple diagram. So there is the state A, the state A prime, or left or right, whatever you want. And then there is the usual structure, like we said. So with a certain probability pi zero, I pick action zero here, which returns me here with probability p-node, I get one and one minus p-node, I get minus one. And then the same here, okay? Let me just, I'm not writing everything back again, like you know how to fill these things. And this would be just prime quantities, zero, p-zero prime, one, one minus p-zero prime minus one, and the same for these other policy, pi prime one, one, p-one prime one, one minus one, and et cetera. One minus p-one minus one, wow, p-one one. All right? And then there is this transition that can take place. So there is another action which is the action of switching. You're here, you must decide, should I flip this or I can take a third action that is I flip, I switch, I go the other way. So there is another action here, the action of switching. This action of switching, let's say you know, switching is not a good idea, let's call it action for switching, it's a mess. We'll send you here with probability one. So if I decide to switch, I will not get shot in the middle. I can go through to the other thing. It's just an assumption for simplification. And then in this case, you get a cost. There is a negative reward minus CS, where CS is positive. That's the cost for the fee that has to be paid in crossing the breach. And the same thing happens the other way around, right? You can switch back and the same thing occurs with probability one, you will end up here paying the same cost. Now challenge, can you do this calculation for that simple model? Give it a try. It's a mark of change. So give it a try. I can do this, okay? But it's painful. And clearly we are not covering much of what the real world gives to us. Remember, we started with the game of go. In the game of go there are, I don't know, 10 to the number of atoms of the universe stays available, right? There are huge numbers that we don't deal with. If we were switching, you think so. Then you have to optimize. You have to optimize over a space which is the space of all possible policies, which is a simplex in a highly dimensional space. Because for every state you will have a number of actions you can take. So this policy lives in a space which is a simplex in a space which is embedded in a number of states, numbers of action, dimensional space. It's not easy, okay? There are other ways of doing this. That's what I mean. And what I want to get to is to show you how these calculations are actually performed. How can you compute in systems which are much more complicated than this in general, the optimal policies? What are the challenges? Some things simplify, some others just don't go away. Some problems don't go away, okay? So in order to do that, we have a little bit to go away from example and abstract a little. So I mean, again, I know you might feel a little bit uneasy or talking about these things without having firm ground, but now the firm ground comes. We're going to give definitions of what the Markov decision process, which has disappeared was up above here, is and how to solve it, at least in principle, okay? And of course you will discover or recover that all this example, when you know exactly what these probabilities in general are and what the states are, are just Markov decision processes, okay? So it's possible to have situations in which your policy changes with time. Strictly speaking, it's not needed unless your PT change as well. You can do that. You can do that. Again, if you know the full future revolution of your PTs, P as a function of times, you will be able to derive an equation for your Pi. But again, it's more complicated because this is again something that happens in the space of states times actions times times, okay? So it's huge dimensional space. But in principle, yes. It's possible to extend all this to non-stationary environments, which I will not discuss, because it's just cumbersome from the viewpoint of, and you don't learn actually much more than what you learn in this specific case. So Markov decision processes, what are the ingredients? Yeah, because it's a cost. It's a negative reward. So I said that CS is positive, so I can interpret this as a cost. It's a minus reward, a negative reward. So you're penalized. You have to pay something if you do that. It's indifferent in neither way, okay? It's just an example, one possible way of implementing this. Clearly, if there was no cost in switching, and you knew everything immediately, then you would switch to the best option of the two and stay there forever. But if there is a cost, then there is a non-trivial decision to be made. That's a comment actually that I had to make before. So if gamma were zero, if you were totally myopic in this setup, and you get some reward out of them, so suppose that you always, the probabilities are such that both of them, both of these p's are larger than one-half, so you always get something out of it. And you're totally myopic, so gamma is equal to zero, you would never switch, never. Because you have to pay a cost, and you don't want to pay a cost. You can get something positive out of this without moving. This happens irrespectively of whether in this case, for instance, suppose in this case, you always get plus one. This is the super bias case in your favor, but if you're greedy, you will never discover it. You will stick to this. Now, if you allow for some horizon, your gamma becomes even slightly larger than zero, then it becomes a comparison, there's a comparison to be made. If I switch, I lose the opportunity of gaining something here, because I'm switching and I'm paying a price. Is it worth of whatever I will get here? I don't know, I have to go there. So you see, in this case already, if you just have two states, then the horizon becomes important. And your decision-making changes a lot, depending on your horizon. If you're farsighted or short, we will discuss about this later. It's not totally unrelated. You can get this question for the moment, because this will take us too far away, but we will discuss about this later. So, I was arising the board. Would you mean not coming? Yes, yes, it's in general unclear from a priori. So if you look at the problem, it's very clear that it's not a priori. So if you look at the problem, it's very difficult to say from the outset whether it's going to be gamma-dependent or not, except in these cases where there's just a single statement, where you know that you will be repeating your experience all the time, so that that really doesn't matter. It just scales the overall gain that you have. But otherwise, in multiple states, typically it's gamma-dependent. There might be instances in which it is not. So Markov decision processes. Shortly, MDP. What are they? So what do you need? If these are Markov chains. So first thing first, what do you need? States. Good. States. There is a space of states, right, which for us will be a discrete set of states. We will always be numbering them, okay, just for simplicity. It's not a restriction in general, but it's useful. Then there will be, it's a decision process, so there will be actions first. Remember the diagram. Okay, there are circles and then there are squares. And then there are transitions. Actions, we'll get to transitions, never mind. So there will be some action states. So again, there is a set of options. In general, this might depend on the state, right? There are some things that you can do in one state and you cannot do in other. For instance, think about grid world. If you're in a corner, not all actions are available. You cannot go up. So in general, this depends on the state, okay? But an easy way around this is just, suppose that all actions are available to all states, but the probability with which you would be able to take that action or this will change the environment is zero. So you can compensate for that, right? But that's a technical point. So states and actions. Then transition probabilities. So what are there? What are they? They are the probabilities of getting to a new state as prime, given the state S and the action A. So that's why I was waiting for A, because the way you go from S to S prime depends on the action you take. Think again about this two state, two-armed bandit. That's what it was. If you pick action zero or one, you would end up in the same state as before. So if I'm in A and I decide to flip the coin, one of the two, I will stay in space A, in place A, because I'm not moving. But if I decide another action, action I switch, I will be changing my state. So my final state as prime would be different from the initial one. So transitions between states depend on the choice of actions. Then rewards, yes, they are all over the place. Rewards, let's go for rewards. So this reward is something which is a function of what? Of the state, of the action, and possibly also the arrival state. So there is this triplet, which is always around, which is state action and new arrival state. And in doing this process, that is starting from a circle, going to a square and moving to another circle, to all these process you attribute some reward. So this might be random or might be deterministic, it depends. Here, in this case for market decision process, we assume that this is the average reward you get while taking this sequence of actions and state. So do you recognize all these actions in the previous examples? All these things that appeared? The just one thing more to add? Observation? No. No, not observation because here we're talking about something in which observations are perfect. Like I told you, we are up above over there. So the policy in general, the policy, which is the thing I'm looking for, and it's important because this is the thing over which we want to optimize, right? So this is the policy by of choosing an action on what must it depend? So in the full reinforcement learning problem, like we said, it depends on all the previous history. But now I said this is a mark of decision process and we know in which state we are. And this is the key property of Markov chains. Once you know where you are, you know what the future will be, at least in probability. So you don't need the full history here to make a decision. If you have access to the full state of the system, you don't need any knowledge of the past because everything is encoded in your actual position and the future will depend just on where you stand now. So the policy for Markov decision processes, it's a probability distribution of erections which depends only on the current state. No, the point is that if the system is Markovian and you have access to the state, all the history is already there. It's a sufficient, in the statistical terms, this is sufficient statistics for all the past history. At this stage, we are. That's why we say Markov decision process. The key point is that there might be an environment which itself is evolving in a Markovian way. So this might be true always. But the point is that we have access to the state. The key point is the observation. If the observation is the state of the system, then we can throw away all the history and we can decide just on the basis of the current state. This is completely unrealistic. In most situations, we do not have access to the full state because what is the full state of the system? Think about physics. You would have to specify all the degrees of freedom of the system in order to be able to predict. For any macroscopic system, this is not the case. Seriously. But this is very important because it helps us in shaping our concepts, our ideas, how we know what we can do in the case, in this ideal case where we were perfectly omniscient and then we just have to compute. It's like feeling like gods for 20 minutes. Being able to predict. Again, this is perfect knowledge because we know what the rules of law are. We know how our system, these things are given to us. We know that just before. We know the pinot and P1 for the one arm banded. Two arm banded, sorry. We know that the system, the observations are perfect, so we just need to map states into actions and not histories into actions. Hope that's clear. Fine. Then it, now it comes, I will try to keep it in 15 minutes so it's going to be rather quick. So the goal, like we said, is to optimize over, so there is one thing which is the return, which is the sum over all times starting from zero to infinity of gamma to the power t, and then there is reward that I get if I s t, if I'm in state s t, pick action at and go into state s t plus one. This is a long one trajectory. So I'm in state s, pick an action a, move that this would bring me to state s prime and then another action and do bring another place. So this is a random quantity and we want to optimize one quantity which is called the value, which is the expectation value with respect to the policy and the transition probability of r. That's the same thing as before, right? I'm just writing it down explicitly. This is the general definition and the goal is optimize this over pi. Now, this is a classical, what we will derive is a classical result in decision making theory, which we'll go under the name of the Bellman equation from Richard Bellman, the founding father of Markov decision processes. If you look into books, it's a bit convoluted in mathematical derivation. We will go for something which should be more intuitive for physicists, which takes a little bit of a long detour, but I think every step is clear than just an abstract mathematical derivation. So we'll try and go through all the steps of this derivation. So the first step is we have to write this thing explicitly. So, and one way to write this thing explicitly is we have to know after these steps where my agent will be. So this is ruled by what? This is ruled by, this is a Markov chain, right? This is ruled by the probability density over the states. And it obeys the Chapman-Comograph equation, which maps the distribution at one time into the distribution at later times. So what is this Chapman-Comograph equation? Let's call rho at time t plus one. This is the probability that my agent is in state s, let's say s prime, after t plus one step. How can I connect this to the probability of being in a state s at the previous time? Because that's what I want to do. Well, if I was in state s at the previous time, then what can happen? I could pick an action a according to my policy with that probability. According to this probability, I will be sent by the environment in a new state as prime. According to the probability p of s prime which depends on s and on a. And all this is some overall possible initial states and overall possible actions. So this is the equation which describes the evolution of the density under the policy pi, which I choose, and under the transition probability p, which is in the hands of the environment. And I cannot do anything about it, but I know it. So if I use this quantity, rho t, then I can write my expected reward as follows. This is going to be the sum overall of the future of gamma to the power t times what? Well, at time t, I will be in position s with that probability. I will pick an action with this probability. I will be sent into a new state with this probability. And I will get an average reward. And all this will have to be summed over states, action, and s prime, over triplets. Do we all agree that this is the expectation value of this? Again. So this thing is a triplet. It depends on where you are at time t, what action you will pick, and where you will be sent. The probability for each of these events are rho t, this is the probability of starting from s, pi a conditioned on s, which sends you here, and p of s prime conditioned on s and a, which sends you here. So these three make for the probability of the triplet s as prime. After these steps, the probability of seeing a s as prime is given by this. Object now, because it's Markov. If you really feel lost now, you better raise your hand immediately. This is what, this is what the expectation of something which is happening in the future. Starting from some initial distribution. At the beginning, your state is governed by some rho 0 s from which you pick one state. Then you have your policy here, you consult with your policy and say, what action should I take? And the policy in your program says, you should pick action a, 0. And then you consult with the environment, which says, where are you going to send me? Oh, give an action a, 0, and state as 0, I will send you to s1 with this probability, fair enough. And then you move to the second step, and so on and so forth. Just because we don't know in advance, we have to prove that the best policy, the one which maximizes this is deterministic. But in general, this thing is a function of any policy. Okay, so that's good, but not super good yet, because we have all these cumbersome times here. So what we're going to do now is going to define one quantity, which is, I call eta of s, which doesn't depend on time, and it's just sum over t going from 0 to infinity of gamma t times rho t of s. You see, this is the one that picks this object here. So we are going to rewrite this as sum over as s prime eta of s by a of s p of s prime s a, and then the word of s a s prime. What does this thing mean? It has a very simple interpretation. You remember, I told you, gamma can be interpreted as a probability of survival after each step. So this is nothing but the probability that I am in state s after t steps, and I'm still alive. This thing. And then I sum overall of them. And then what does that mean? That this object is the time that I spent in a given state s before being killed. It's the overall residence time in state s, average residence time, before being killed. Okay, do you agree on this? Okay, now we have come with an expression which is more compact. And what do we have to do? We have to optimize over v pi. So now we're going to do this as physicists do. So first thing, oh, some mathematicians, but they do it more accurately than we do. First thing, we have to find the stationary points of this function with respect to pi. So we want to take the derivatives of this object with respect to pi. What's the difficulty in this? Can you spot the difficulty? No? Okay, so let's start with doing this. I take the derivative with respect to pi. Does this depend on pi? No, this is the environment giving me the rewards. So pi has nothing to do with that. Does this depend on pi? No, this is how the environment evolves. It doesn't depend on how I choose my actions personally. Once an action is chosen, the environment will go that way, but it doesn't depend on the probability that I give. Does this depend on pi? I mean, it is pi. Does this depend on pi? And that's painful, because it depends on pi in a very, very intricate way. It depends on pi through all times, and this depends on pi through this. So do we want to take the gradient of this with respect to pi? Not sure. Yeah, I will argue that it's not the most efficient way of doing this calculation. Is there a workaround for this? So this is something, actually, it's a trick that you have to learn, okay? Because it pops out in so many different domains of physics, and you might have even used that without paying attention to this. So what do you do when you have such a nasty thing? What would be your ideal world? My ideal world would be the situation in which this is independent of that. So I could have to take the derivatives with respect to this, or with respect to that. But we said, okay, but this is not independent of that. And why it is not? Well, because it has to obey this. And then the idea pops up and says, well, perhaps I can try to optimize this thing regarding this object as an independent variable, and using this as a constraint just out of curiosity. Well, who has ever heard about Martin C. Jaros? Martin C. Jaros action? Nobody. Okay. Now you know how it works in a totally different setup. So if it pops up, that's the trick. Okay. So there's a price to pay for this, right? Because what we'll have to do is to introduce this dynamics as a constraint. And if you want to introduce a constraint in an optimization problem, what do you do? Lagrange multiplier. In order to do Lagrange multiplier, what do you have to put? The multiplier, which is an additional function coming into the game. So the price to pay is that you decouple the dependence of this from this, but you will have another additional field or vector. I will ask another question. Who has ever done the Hubbard-Stratonovich transformation in statistical physics? Isn't that the same? It is the same. And there are many other things in theoretical physics that you do the same way. You decouple nasty nonlinearities by expanding into constraints when you have done. Okay. I will never be able to make it in three minutes. And at one, I have to leave. So I will just outline what we will do first tomorrow before moving to something else. We will do exactly what I said. The only thing is that rather than using this equation for raw, we will use directly an equation for eta. That is the equation for the average residence time with killing probability. So what is the equivalent equation for this thing? I will write it down here. So eta obeys the following equation. Eta of s prime is equal to a row node of s prime plus gamma sum over s and a of p s prime condition on s and a by a s and eta. The average residence time with killing or without obeys a closed equation. And this is all we need. This is the constraint we need in order to use this formula. So we can get rid of all this row t around. It's a recursive formula. Okay. Ever seen that? Probably not in this form, but in fact you have. You don't know. Okay. So for tomorrow, if you want to do the exercise, try and prove this formula. We will start from this, from the combination of this expression and this formula tomorrow to derive the optimality equation and the result that we're looking for. See you tomorrow.