 All right. So in this part of lecture, we start introducing the basic concepts for Markov decision processes, which require the formalization of concepts that we've been already discussing. So we are starting out with considering a set of discrete states, which belong to some set of cardinality S here on the right, a set of actions, which is, again, as a fixed cardinality. So these are discrete actions. Again, there is no specific fundamental difficulty in extending to continuous states and actions, in the sense that the difficulties that arise with continuous state and action also arise when your discrete states are very, very large. So there's no actually need to accept for technical reasons to treat them differently. And one key ingredient in defining the Markov decision process is that the dynamics through the space of states is described by a transition probability, which is given here. And it expresses the probability that starting from a state S and taking the action A, the system ends up in a state S prime. OK? So it's useful from the beginning to introduce a graphical description of this process, which I'm putting down here. So you have a general question about Markov decision. So these process are memorial less because the models or the system we are studying are memorial less or because it's really hard to implement the memory as a part of the system. OK, very good question. So in the following, we will assume for now that the system is truly memory less in this state space. This means in general that often in order to, for this assumption to be true, your state space should be very, very large. OK? And this is, of course, a difficulty. You can treat your system as if it were Markovian, but in the sense that you use this Markov assumption as a model of your system, which is not Markovian itself. But we are not discussing this now because this will fall into the process of moving from the upper right side of our diagram down to the situation where we have limited observability. Because one source of no Markovianity is limited observability. So we will discuss this separately. Thank you for the point. So yeah, other questions? So one way to graphically represent this information is that to think of your system as a set of states which you can depict by circles, OK? So these are my states, state 1, state 2, state 3. And then from each state, I can take a series of a set of possible actions, OK? For instance, from state 1, I could take action 1 and action 2. And if I take action 1, I might end up, for instance, with a certain probability here or with a certain probability here. And this probability would be the probability of ending up in S1, even that I started in S1, and I took action A1. And here would be the probability of ending in S2 given the start in S1 and the action A1, OK? And so on and so forth. You can have several of those. And you can imagine whatever complicated the graph, OK, with several actions here, I'm using the same actions. And you can fill in all the quantities that you want. So in these graphs, there might be terminal states. You see, eventually, all this process ends up in S2, which might be a terminal state. So everything dies in that state. Or there might be situations when everything is recurrent. It moves around restlessly, OK? So it's useful to keep in mind this idea of a graph of a Markov decision process, OK? But we were listing our ingredients. So we had states, actions. We have transition probabilities. And as a matter of fact, so far, OK, we've been describing a Markov decision process, but we are lacking still a fundamental ingredient in what we've been putting in so far in the sense that this idea is that we want to control this process. That is, we want to make decisions here that affect what's coming after. And one way of doing this is to introduce the notion of a policy. So what is a policy? A policy is a probability distribution over actions. Actually, it's not one probability distribution, but it's a class of probability distributions. That is, what is a policy? From every state, you decide how much probability you want to give to the two actions. So how much weight you put here? So you may want to put it here. Like, the weight you put on this action, not this one, the way you put on these actions is the probability of picking action 1 given state 1. And the one that you put here, maybe smaller, is the probability of taking action 2 when you are in state S1. This is called the policy. And it's important to realize from the very beginning that this part of the problem, there is the way you transition from one state to another. So this is the policy. But this other part, which is the transition probability, which is what is encoded in these arrows, they are very different from each other. This one is not in control of the agent. The agent has no control of this. Once you take an action, what will happen is not in your hands anymore. It's a consequence of the way you're done as an agent, of the way the robot is done, and the way the environment is. The environment can also change by itself in a way that doesn't depend on actions. This is also included. The environment can be dynamic. Even if you take whatever action you take, the environment can change, and you can have no control of that at all. On the other hand, this part here of the policy, this is something over which you have control. And the goal will be to find out the best policy. But best policy with respect to what? So we still have to define the central thing in our problem, which is to define the goal. And in order to define the goal, we have to add something here to this description. We have to add some other ingredient in the form of something which is known as reward. We have to introduce rewards. So what are rewards? Well, in the general case, we can introduce them here. So we can tweak our transition probability in order to add to each of these steps the probability of going to a certain location and receiving a certain reward. When you think about reward, the easiest thing to think about them is just think about money. So I am in a certain position S, which might be, for instance, my bank account or my series of assets. And then I take an action A, which might be buy or sell something. And in the process, I end up in a new state, which is my assets have changed. My bank account has increased, decreased, depending on what I got in the process. So the action is the choice of what kind of investment you do. And the reward is the immediate game that you get in this single step. So this notion of reward is very important. So let me put it here. Rewards. Or more generally, they are also called reinforcement signals. Because in animal and human behavior, these are not necessarily rewards in an obvious sense, but these are more abstract. And so rewards are immediate gratifications, if you wish. And as we will see in the second, these are not necessarily the goal of themselves in reinforcement learning. But rewards are important. So how do they enter in my graph? Well, it means just that every time that I make a transition, on top of this, there is a certain probability of getting some reward R. So here, you get some reward R. You get some other reward R. So every one of those is potentially different. These rewards are real numbers. Rewards are real numbers. One dimensional feedback from the system, which is telling you how good you're doing in the process. So for instance, the rewards in our bandit problems are what you get once you pull the R. This is the obvious and immediate interpretation. In the navigation task in the grid world is how much dust you collect if you move to another point. So remember the graph for the grid world. You are in a state S, which is your green dot here. So this is your initial state S. Then you take an action. Suppose that this is the action you take. And if you take that, then you end up in a new state here, which is in your new status prime. And is there any dust in that tile? No, no dust. Therefore, you take reward equals 0. And then move around. And if you were here, for instance, if your robot was here and it took a move down, and therefore it ended here, then in this case, the reward would be, say, 1. Or if it goes to the charger, maybe the reward is 10 because it's charging. Whatever, you decide how to attribute this. So notice that in this case, the transition probabilities, the one that I was described, were very simple. Because we are not saying that in this case, the robot, if it wants to go west, it will go west. But the MDP also accounts for situations in which it wanted to go west or to go east. But it turned out that it goes sideways also. This is also included in this kind of algorithms, even if we didn't discuss that. So actions, it's important to realize that actions have to be interpreted as intentions. This is what the agent wants to do. But S prime is the actual outcome. So the agent is in a given state as the intention of moving to another state, for instance, which may or may not result in doing that. This is the distinction between the intention and the outcome. Apart from all this lexicon, which is at the same time useful and sometimes confusing, the mathematical definition are crystal clear. So this basically is the last ingredient we need. Because once we have the rewards, we can define our goal. And the goal is to, I will write it explicit, and then we discuss it together. The goal is to maximize the cumulative reward in the future over policies. So when I say maximize, I always have to specify over which kind of variables I maximize. And here the parameters that I can use to maximize are the policies. And the second important distinction is that I want to maximize the cumulative reward. So I don't want to maximize what I get immediately, not necessarily. I might be interested in some long-term goal. Of course, if you're interested in some long-term goal, that's where planning comes into the game. So if you have a navigation task and you say, OK, my goal is to go from Trieste to Udine, then the route that you want to take is the one that tries to optimize over these possible routes. But if you want to go with a shorter horizon and say, you care only about the distance you make in five minutes, then maybe you take another route, which is faster, but doesn't take you exactly where you want it in the first place. So it's important to distinguish between this short-term and long-term goals, because the strategies will be different. If you have a short-term goal, you will do things that are different from you if you have a long-term goal. I don't need that to explain to you because you are here. And if you are here, it's clearly because you are not optimizing over a short-term goal, because otherwise we'll be sitting in the sun. You are optimizing over some kind of long-term goal, which might be different for each of you, but it's certain long-term, OK? It's certainly longer than these two hours. OK. So what does it actually mean, maximize over the cumulative reward? We have to make it precise. And to this aim, we introduce an object which is called the return, which in fact is the averaged return. So its definition is that it's the expectation over a sum which extends in the future. So starting from any given state, you look into your future, OK? So bear with me. If you are still asking question, how can I possibly do that, we will answer to this. But you look into the future, into what will happen in the future, and you want to maximize this object. Let me write it like this, and then I will explain what this is. So remember what is happening here in this process. Every time that you take an action according to a given policy, a given strategy, distribution of actions, you collect a reward. And then you repeat that again, and then maybe you take the same action or you take another one and you collect a different reward. And you move around in this space state by picking actions and collecting rewards. This object here is the sum of all possible rewards. This is one possible form of it, OK? This is one form, which is the discounted. I will explain what this is in a second. Or there is the finite horizon form, which is maximize the sum from t from 0 up to a certain time t of your rewards along the way and discounted. So this expression below is probably the most intuitive one. You say, you decide, my time horizon is capital t steps. I want to optimize over time, which is 10 days, 10 years, 10 minutes. And I decide this beforehand, and this is going to be my horizon. And then I ask, what is the best policy in order to achieve this result? Now, as you can imagine, one important thing is that if you set yourself a fixed horizon, your strategy will change over time, OK? So suppose that you say, I have to turn in my homework in 10 days. And then what will you do? You will do nothing on day 1. You will do nothing on day 2. You will do nothing on day 3. And then on days 8 and 9, you will do everything, OK? So typically, if you have a finite horizon, the best strategy is actually to do something different in time. On the other hand, when you use the discounted version, which is the one on the top, we will see that you have stationary strategies over time. So the best strategy is the same irrespective of when you start. So what is this discounted thing that we have here? So this object here, this parameter here, gamma, is called the discount factor. It is typically comprised between 0 and 1, with equality under certain conditions. But now let me state it like this. It expresses a very simple idea, which is the same idea of the time horizon. So for instance, let's consider the situation when gamma is 0. If gamma is 0, there is only one term in the sum, OK? So this quantity here, which we often call g, this thing here, this sum, so g becomes just the first term. So you're only interested in r1, in this case. So this is the situation where you are very myopic and you focus on immediate gratification. On the contrary, when gamma tends to 1, it means that you're taking a sum which has a very, very long contribution from times which are very far in the future. So in this case, this situation where you are looking ahead a lot into the future. And this is what makes it a long-term goal. Notice that in many situations like navigation, the goals are long-term. Because in order to reach a point, you have to go through a desert of gratification, OK? So there will be nothing for you for ages until you get to the final reward. So planning is always something which requires long-term goals rather than immediate gratifications. OK? So this discount factor is something that you can easily interpret it by saying, if I get a reward tomorrow, it's going to be more valuable than if I get a reward the day after tomorrow. And the way this gamma is close to 1 says basically how fast you go towards 0. Basically, everything that happens at times T, which are much larger than 1 over 1 minus gamma, doesn't count. OK? So this series here is exponentially cut off at times which are larger than this. So this also can be called also the horizon. It plays a similar role to this capital T only in a different way. It's a sort of more smoothed version of a harder deadline. OK? So clearly this object here that we want to optimize depends on the policy because the choice of the actions affects the choice of the rewards. And therefore, it makes sense to ask how to maximize this thing with respect to policies. So formally, this expectation depends on the piece and on the price. But remember that we can only act on this. So that's the only set of parameters over which we can optimize. OK? So this is basically the formal setting for Markov decision processes. So the next question that comes in the line is how do we solve them? So is it possible to give an all this information so you, as a planner, as a decider, you are given this information, so this transition probabilities, you can choose your policies. One of these two is your objective. How do you compute the optimal policy? What are the properties of this policy? How does it look like? Is it random? Is it deterministic? So all of these questions we will address in the next lecture. Starting to set up all the machinery that will lead us to introduce this notion of optimality equations and how to use them for planning. OK? So that's going to be it for today. If you have questions, please come forward. I have one. Yes, please. I'll open the camera. I've seen in many, everywhere basically, that the gamma has to be lower than 1, which makes sense when our app is it can be infinite. That makes sense. But in practice, I've seen that when you have a time horizon, so let's say a game that after a certain number of steps ends, in many applications, it's actually set to 1. But in literature, I never found this, but at least myself, I never found anyone explicitly saying that gamma can be equal to 1, which again, would make sense in this case. Because if we're playing a game, the reward is we're winning or losing. And we just care about the eventual reward. OK. Gamma can definitely be 1 in two situations. So one is this, the one that I'm writing here below. So you see, I could have added the gamma here in the finite horizon case as well. And sometimes people do. They can combine the two. There is no particular restriction. What is important is that as long as you put a hard horizon, your optimal solution will be generically be time dependent. OK. And if you seek for stationary policies, it can only be approximately optimal. So that's one formal thing to say. Another thing is that you can actually put gamma to 1 in the situation above, where when your sum extends infinitely into the future, if your Markov process has some absorbing state, like in this case here. So if there's some absorbing state, you can put gamma equals to 0, because with probability 1, your system will end there. So as long as when you're there, you get no final reward. So all the transitions here don't give any additional reward. Then you can set the easily gamma to 1. Which means that if you don't know the MDP, then you cannot do it. You must know the MDP. Exactly, yeah. OK. And so I also should add that there is a way to take the limit of gamma tending to 1 in order to describe processes which are stationary. So it makes sense also to think about gamma not being set to 1, but approaching 1 in a limit. So all these things are, I think they are very well described in Sutton and Bartos' book, which they describe all the possible combination of these counting factors. These two approaches are send out. The one that I wrote here, send out for theoretical simplicing, because in the first case, in the second case, this will be the key for us to do what is called dynamic programming. That is starting from the end of the episode and building up the optimal solution by going backwards in time, which is one important key. And the second one is important because this particular form of time discounting by means of these geometrical factors is the one that is key to writing down an optimal equation that doesn't depend on time itself. So everything will be invariant under time translations in the first setting. So because of these very nice properties, we will insist on this. But of course, you can mix up the two. You can do approximate stuff. That's for sure. OK, thanks. Sure. Any other questions? So if not. A question. Yes, please. So I have two questions. First is, I don't know if it makes sense to see it as the average as the like a present value of the disliking in lines. So this average has to be thought in the following way. So if you find yourself operationally, how would you compute this? Suppose you find yourself in state S1 and then you start doing things according to a certain policy. So you choose a policy. That is, you choose the probability distribution for the things you want to do next, the actions. And then you start doing a Monte Carlo in the sense that you pick an action according to this probability and you observe the new state. And then you pick another action. Then you observe a new state. So graphically, what you do is another way. This gives me opportunity to introduce another graphical description. So if you are in a certain state S at time 0, you use your policy to extract an action at time 0. And then these two together through your P will send you to a new state S1. And in the process, you will observe a reward R1. And then you repeat. So this is how you write this process in the form of a temporal graph. So suppose you start here and then you just do a Monte Carlo and then you simulate and you will see what you observe and you collect your data. You sum and then you repeat again and again and again and again all this. And this sum of Monte Carlo objects will converge to this expectation. The point is that this procedure of doing things by Monte Carlo is very expensive. So you want actually to know what happens in the future without explicitly simulating it. How this is possible? Well, you will have to wait for tomorrow. You had another question? I just want to understand more why we have the discount factor. Aside from the fact that you mentioned that it gives a good property that what you have is time invariant, is that the only reason? OK, so there are two reasons. One broad reason is that it introduces one way of thinking about how far into the future you are interested in. So this is the first thing. And the second thing in general, you could have added the different ways of discounting. So this function gamma to the power t is one particular choice which has this nice property, which is actually a consequence if you want of the fact that the geometric distribution here, so these geometric factors, are the equivalent of the exponential distribution. And the exponential distribution is memory s. And that's why this object doesn't depend on time. So this is the technical reason. It's important to note that the psychological experience with humans have shown that humans do not use this kind of discounting. So they discount in a much slower way, sort of a hyperbolic rather than exponential decay of discounting, which is very interesting from a viewpoint of psychology and neuroscience. So one could be interested in changing this kind of discount factor in the future. And there's a lot of work that they interface between decision making and neuroscience on death. But the calculations become cumbersome. And for the purpose of clarity of explanation, I will stick to these two settings. OK. Thank you. I have another small question about the rewards. So for now, what we say is that the reward is simply a real number, a positive, it's a real number. And it is associated with, because I would have said without knowing before, I would have said that we think of it as a function. But for now, we just say it's a number and it's associated with the couples of states. I think I get what you're aiming at. So let me try and anticipate you on answering this. So I've been saying that there is a probability distribution which generates rewards and new states. So actually, this is a probability density with respect to R, but allow me to be what we will do tomorrow. And we will introduce an object which is the expected value of R given S and A, which is the integral of all possible R's of R, P of R, S prime, sum over S prime of S A, which we will also define with the slight abuse of notation as R of S, A. Probably I'm not going to put this. So it's going to be better. Just second, I don't want to mess up with the notation now. So let's put this aside and then let's use this, OK? So these R's in general are random variables, which can be experienced at every time you move from a state to another state through an action. But in the process of optimization, since you are interested only in averages here, it's useful to introduce this conditional average. And we will call this as the average rewards. So basically, the point is that given a starting state, an action, and an ending state, even if we know all of them, in theory, the reward could still be different. But then we take an average of this. Exactly. So the distribution of rewards may vary. It was maybe stochastic, but in solving the planning problem, we will already be interested in this object. And so in this sense, this is a function, because it's a function of a triplet, S, A, S prime, that you're visiting along the way. All right. Otherwise, we should think of it as a distribution, basically. Yeah, exactly. This is something that you don't want to know. OK. OK, thanks. Can you answer us about the reward? Yes, please. So maybe we can also think of our reward as the opposite of our reward that we want to optimize. We want to minimize. So for example, if there is an error, and we want to minimize it. Sure, sure. I think I got it. So the basic idea is that the rewards, we tend to think in the decision-making context in terms of rewards, which are positive, they could be negative. So a reward, which is negative, is sort of, loosely speaking, a punishment, even though it's a bit of an over-stretch of definition, because in animal behavior, positive and negative rewards are actually differently perceived. But nevertheless, so you can think also in terms of what you're saying, if you reverse rewards, then you would have costs. OK. So in everything that follows, it's sufficient to put a minus in order to switch from rewards to costs. OK, so you can formulate all these things in terms of costs. Everything that does something, I get a penalty. And then I want to optimize in order to reduce this penalty, which might be an error or whatever. Is that what you're asking for? So it's absolutely symmetric in the following. Even though, like I told you, there are situations in which there is the difference between plus and minus signs, but this is something more about neuroscience than the mathematics of decision. And can you give an overview or tips on how to choose the correct gamma between 0 to 1? In your real life, you may know. Choosing the gamma means that it depends very much on what kind of objective you are, right? So for instance, let's go to the navigation task. So let's suppose that we are starting. Let me redraw this again, because otherwise it's a mess. I keep on drawing on the same graph. So let's go back to the navigation task. I'm not going to write down the grid. I'm just writing down the fact that we have some initial state S. And the actions are always something which are very localized in space. So they can only bring me in a small neighborhood of S. And then I have certain rewards which are displaced around. So for instance, here, there is a reward. Here, there is a reward. And here, there is a big reward, OK? So like I told you, the number 1 over 1 minus gamma is essentially the horizon, OK? So you can expect that if you set gamma, for instance, equals to 1 half, the only thing you care about on average is what happens over two times. So the typical time you care for is 2. Now, if the typical time you care about is 2, then what is your optimal strategy of moving around? Well, you will definitely go for the closest reward. So all your policy here would be, oh, go towards that. If you are close to this, well, go towards the closest one. And if you're there, go towards here. But now suppose your horizon is longer, and you say set gamma order of 0.1, then your horizon has become 9, OK? So you have about 10 times that ahead of you to look forward. Then in this case, maybe your strategy would be different, because if you are here, you start here. Maybe what you would do is, OK, why don't I collect this first? I'm blue. I collect this first, and then I go here. So I take 2 in a row. And if I'm here, maybe I still have time to do this if I have 10 time steps, rather than going directly to this. Or maybe I'm going to do the opposite, since this is the biggest one, and it's decreasing in time. It's value. Maybe I go this way now. So in one of our tutorials, practical tutorials, we will see exactly this. So how does the choice of the horizon affect the kind of strategies to move around in a navigation problem? Typically, if the horizon is short, you will settle for the immediate closest thing that you get. If your horizon is longer, you can develop more complex strategies and try and around to collect stuff, depending on how valuable it is and how much it will decay. You can imagine that there is some critical gamma, right? Because suppose that you have two rewards. One is small and one is large. And if your gamma, since both rewards tend to decay in time, you might want to choose one before the other, depending on the relative value and the value of gamma. So depending on your time scale, you might decide, well, I picked this or I picked this first and then maybe the second, depending on how far can I go. Can you get just the general idea? So it is like an arbitrary parameter. It is a parameter which you decide in the beginning. OK. And it's part of the model of your environment. And it's something which is a decision. It decides what shape your goal has. It's not something that you change over time with the algorithm. It's fixed at the outset and there it stays. So the choice of gamma only affects the accuracy? No, no, it affects the strategy. I mean, it's very important. It affects what kind of solutions you find. OK. Sorry. Could it be thinking about this example? So basically, based on the gamma, the agent will decide where to steer or toward which reward first and eventually to go on. But this is always probabilistic in some sense. It's not the termistic. So could it be that say that we have an agent with fixed, he made up his strategy. And then we are evil manipulators of the environment. And we tweak the rewards on this graph so that the agent is pushed to go toward the bigger reward but doesn't have enough time to reach it. Like, could it be possible tweaking the rewards in the environments so that the agent's strategy breaks? Because maybe the value of the reward is so high that he's keen to go towards it even if the time is not sufficient to complete the reward. So that the expected reward is suggesting to go that way. But in practice, it won't be getting there. OK. So let me first clarify if I understand correctly. So when you say, I mean, you are designing a certain structure of the rewards, but you're not changing it over time, right? You're just keeping it as it is. It just you're like, yeah, yeah, yeah. Yeah, let's say that we design it so that it tricks the agent in. The reason I'm asking is because what you were saying looked a little bit like an adversarial situation in which you are acting against. And since there is an old part of reinforcement learning which deals with the games and adversarial things, I just wanted to be clear that you're not thinking about that. Yeah, no, no. It's not the dynamics. Like we're designing some complex, some tricky environments. Is that what you're saying? Yeah, like we can choose the environment in the beginning, but then we let the agent go. But we know the time, for example, that he's able to stay alive. Good. So what is the fundamental result is that whatever design of the environment you choose, there are algorithms that allow you to compute the optimal solution for that environment, given the gammas, et cetera. So even if it falls short, that's the best thing that it could do if it does that. So you are assured about that. Some strategies might seem counter-intuitive, but the mathematics doesn't lie. These are the best ones that you can get if you solve the problem correctly. OK, so basically, stochastically, the answer is no. The agent will still do the most reasonable thing. Exactly, even though there is a certain probability which might be small or missing the target, that's the best thing that you can do on average. Another thing is that if you change the game, I mean, if you don't ask about winning on average, but if you ask other questions like, I want to get the specific reward. High reward with a given probability, OK? So you're discussing the situation which is called risk-sensitive learning, which is very interesting in itself. It also has a very well-developed mathematical framework, but we are not discussing this now. No, not just out of curiosity. Sure, that's a good question. Thanks. OK, I think we're really around very late today. I'm both happy and sorry for this. Happy because you're interested. I'm sorry because I overflotted our flow. But let's call it a day, and we meet again tomorrow at night, OK? Have a nice day. Thank you, bye. Thank you, bye. Thank you, goodbye.