 OK, welcome back everybody. So for today's lecture, we are going to deal with different settings of the Markov decision problem. So the subject will still be essentially tightly connected with the one we approached last lecture with a difference that we now have. Sort of more approachable problem because we are going to do away with the time dependence of the optimal solutions. But one step at a time, so let me start with a quick recap of what we did last lecture. So you can sort of get back to speed immediately. So last time we introduced formally the notion of Markov decision processes. So we scroll through the slides. So as you may remember, the structure is made of states, actions, and rewards. And then there is some transition probability, which means that we are explicitly including the possibility that everything is stochastic in the dynamics of our system. And then we have access to a control over the system in the form of a policy, which is a mapping of states into actions. This is a Markov process, which is probably and this in turn, sorry, I see something in the chat. Yeah, the lesson throws for like five, six seconds. OK, so I will sort of, so you've not been hearing me for five, six seconds, is that right? Yeah, last thing you said was you were mentioning the mapping from states to actions, at least as far as I heard. OK, good. So we can resume from that. There's no need to restart all over again. So for a given policy, this is a Markov process in itself. In the sense that when we start from a state, we can use our policy to generate an action. And then this action together with that state will produce a new state. And in the process, the system will also issue a reward signal R. And this reward signal is important because it contributes to construct the objective function, which in the case we discussed is a finite horizon objective. So we are accumulating rewards over the time horizon, which in this case is capital T. All right. So the first important result that we derived is that this, OK, we discussed a simple example, which is the recycling law. But we introduced the very important concept of a value function, which is nothing but the expected reward at a given time, small t, given a certain state we are in. So if we sit on a state S and we are at time t, the expected gain under the policy pi, given a certain strategy pi, which is a sequence of strategies actually over all possible time steps. So given that set of decisions, how much we expect to gain. And this concept of a value function is quite important because we can use it first to derive a recursion relationship, that is, which maps the value of all possible states at any given time with the value of the same states at a successive time. So this is the recursion relation that we have been deriving explicitly, which by itself it suggests that there is some sort of backward iteration to do, right? So if you have a given policy, so if you decide from the beginning that you will be following a certain mapping of states into actions, OK, you choose it at the beginning and say, OK, at time 0, I will use this policy. I will use this strategy. And then at time 1, I will use this other one. And then at time capital T minus 1, I will use this one. Then you can unroll the values just starting from the end and linearly solving this equation. So you start from your final value of the policy of the value, sorry, your final numerical value of the value, which is 0 at the final time, OK, because there's no future. So every point is varying 0. And then you start back, OK, and from your final time you go linearly to the previous time and then to the previous time and to the previous time, OK? So for any assignment of the policy, you can solve the value. This is good, it's interesting, but it's not enough because you would like to identify the best policy, OK? So the best sequence of decisions over time that guarantees you the maximum amount of return, which is the objective that you want to realize. So it takes a little bit of manipulation in order to go from this recursion relation, but it's not like I told you, like I showed you last lecture, it's not so daunting. And then to arrive at this new equation, which is ugly, cut into two different lines. And now here I realize that there is a missing parentheses. So I'm going to put it here, OK? So just close it here, which now maps optimal value functions at subsequent times into optimal value functions at preceding times. So the only thing that has changed with respect to the recursion relations is that this relationship is no linear, OK, because of the max function that you see here in the middle. But that doesn't cause any problem in the sense that you can start from the end with your optimal value function, which would be 0 anyway, because any decision you take at the end doesn't really matter. So the optimal value function is 0 at the time capital T. And then you use this equation to go backward in time in a process which is called dynamic programming, OK? So this is how we outlined the discussion, and this was comment, et cetera. So eventually, what you come up with is basically a set of optimal actions to take for each state at each time, which is excellent, because essentially this means we have found a way to solve our problem. And in the process, if we are interested only in optimal decisions, in the process we can sort of erase all the information about values. So this also can be quite compressed in terms of memory requirements, but nevertheless in terms of actual computation, this is pretty heavy, especially when the number of states and actions become very large, and also because of time, right? So you have a long horizon, large state spaces, large action choices, this problem becomes soon untractable from the viewpoint of computational power required to solve it. But nevertheless, it's very important because it's a cornerstone of all approaches that we will use in the following. I think the sum here is on S prime, no? The sum here is on S prime, yes. It's what I meant to write here, but it's probably not clear enough, so let me write it more specifically. Is that what you meant? Yeah, that's correct. All right, so then at the end, we discussed one relevant problem, which is the traveling salesman problem, and I told you that it can be cast in the form of a dynamic programming problem or can be solved exactly by dynamic programming problem. I will not dwell on this because we will have our tutorial session on Friday, April 9th, and it will be devoted in part to the solution of the traveling salesman problem, so you will be guided through the considerations, theoretical considerations, what things work and what things don't work, and there will be also a Python notebook that implements with a reasonable efficiency the solution of the problem. And this will be one part of the lecture that we will have on April 9th, okay? So you may want to mark the date, save the date on your calendars. Okay, very good. So this was a quick overview of what we had last time. So for today, I would like to, first of all, start with something very, very light, which is another exercise inspired by dynamic programming. And this exercise is what I call, what I call procrastinator, okay? So the motivation for this little example of the Markov Decision process is the actual realization that at some time you find yourself the day ahead of the deadline and you curse yourself and you say, why should I always ever end up doing the things at the last time, okay? And then after a while, you realize that you're not the only person in the world who suffers from this sort of apparently suboptimal behavior. And I would not say that it's about all humanity because that's not true. So you sort of move into the sort of try to understand the psychology of it. And so you might want to ask yourself, so it's not just me who was wrong, but maybe that's something which is wrong about all humanity. So maybe that's the way we reason, which is not entirely adequate to the task we have to do. And this is one possibility, okay? And there's also a lot of psychology about it. The other possibility is that you embrace fully, you fully embrace your limitations and say, no, maybe the reason is another one is that there's something optimal in what I'm doing. So we are all doing the same thing because we are optimizing over something, okay? That's another viewpoint which I take because I sort of unapologetically procrastinate. And so I thought about what I, if I cook up a little exercise that sort of tries to explore the situation in which procrastination is the right way to behave and when it isn't. And so it's useful as an exercise in solving the Bellman's equation and it's useful as an exercise on reflecting about when some decisions are good and when are bad and how this depends on the model of the environment, okay? So the exercise goes as follows, it's very simple. So there are two states and these two states are basically done and not yet, okay? So at the time zero, you're given an assignment, no? Assignment is given, okay? So at the very initial time, two things may happen for generality. So typically if you are given a certain assignment you are not done it yet. So typically you will start in the state not yet done, okay? The state below. But maybe because you're lucky, because it happens that you have already done it or maybe because your teacher is lazy and you can find the solution of your exercise but just by Googling it and therefore you find yourself that on day zero you are done and then you're happy and nothing happens, okay? So in general, your initial state will be a probability distribution over these two states, okay? Which is the row note of our previous lectures. There is some distribution over the initial states. Then what can you do? Okay, if you're done, well, you're done. There's nothing left to be done. So you will go back to your state done with probability one and we'll reward zero, okay? So here, of course, there's a lot of leeway in deciding what the parameters are, et cetera. I will keep it very, very simple just for the understanding of the problem. Okay, and this will go on until the horizon, no? In the horizon for us will be capital T sometime in the future, okay? So we start today, the assignment is due in 14 days then capital T is 14 and every time step is one day. Okay, so what happens if you are in the not yet situation? Well, in the not yet situation, you have two options. The first option is to take the action procrastinate. All right, if you procrastinate, well, this will take you back to the same state the day after with probability zero, sorry, with probability one, I hope so, and with reward zero. Okay, so this just says one day has passed. But if I take action, that is I get it done, okay? So I say, okay, I'm gonna spend the afternoon in doing this assignment. Then two things can happen in our model. Well, you may actually complete the assignment because you're quiet and in a good mood and everything goes fine. And then you can get it done with the probability one minus epsilon and what this epsilon is, I will tell you in a second. And of course, since doing homeworks takes some effort, you will have to pay a price, which is this small C. This small C is the cost for working, okay? So this is the effort that you put. C is positive, so it's a negative reward. What's the good point in it? Is that once you're there, well, you're sure that you're okay and you will stay there for all the remaining time. But sometimes it can happen that no matter how hard you try, you can get your homework done. Because the connection is down, because your friend calls you and say, I have tickets for the concert or for the cinema. And then you say, you know what? Screwed exercise, I will do that tomorrow, okay? There is some chance that it will happen. Of course, if it happens, you don't get it done, but also you don't pay any price, okay? Because you have not put the effort in it. So where's the trick? Well, the trick is that at the end of the game, when the horizon is reached, if you are in the done situation, okay? So at time T, capital T, if done, then zero reward, okay? You've just done your job. Everything is as it should be. No reward, no penalty, you're okay. But if you are in the not yet, then you get a big fat minus capital C. So you get punished for not having completed your assignment in time, okay? Clear enough? Where capital C- I'm sorry, it's the same C? No, it's another C. Another C, all right, let's call it C prime. It was capital C, but C prime is more clear, I think, okay? So the penalty is always larger than the cost that you put for the effort. Otherwise, there will be no conflict, okay? So you have to create a situation where there is some conflicting requirements in order to have a non-trivial decision-making problem. So questions. What is the best strategy? How do you obtain it? How does it depend on the parameters here? And notably, these two costs and the probability that you will be distracted. So you can ask us a recursive Bellman equation from the final time and then go backwards. It throws again for another five seconds when you- I will repeat. So the question for this exercise is to find out the best strategy and to understand how it does depend on parameters. You may want to try to do it with analytically or solving numerically. The advantage of trying to do it analytically is that you have immediate control over the parameters and you can answer all your questions about the parameters immediately. If you do it numerically, you will have to scan for different parameters, okay? But the basic idea is that you can come up with a sort of face-space diagram of where some decisions are best and where some others are not. Clear enough? Okay. So like I said, these things are not required. You don't have to turn any of these exercises into me for corrections. But if you want to do it, then you ask for advice or for discussion. I'm, of course, available. And of course, if you get very, very interested and passionate about this and you want to elaborate on this exercise or others that come to your mind, these could be suitable exercises for the final project, for instance, okay? So be proactive in trying to find out what you find most interesting and this could be the basis of the final project. Yes, please. So the parameters that we want to estimate in this case are only epsilon, which is the only thing that we have control on. Yeah, the epsilon, well, we don't, no, epsilon we don't have control about. It's the property of the environment. Okay. All right. Because it's in transition probabilities. What you have control about is the policies you decide at each time, whether you want to procrastinate or to get it done. I'm sorry, yeah. Yeah, yeah. And, but the point, the point of the point thing is that you might think of these parameters as not as properties of the environment, but as what kind of model the agent uses to describe the environment, okay? So I'm flipping a little bit the interpretation. Suppose that there are two agents. So the environment is always the same, but there are two agents. One agent estimates that you will be distracted a lot. So it's epsilon will be large. And the other agent says, I'm a very focused person. So I will be distracted a little, okay? So depending on the kind of prospective vision of what the environment is going to send up, then they might decide to do different things. Is clear? That means that we take into account to the possibility that we have like a high or low epsilon and hence we adopted the, well not adopted, but we choose the best policy based on that. Well, the policy depends on epsilon basically. Exactly, the policy always depends implicitly on the parameters of the environment, on the kind of transition methods and rewards that you get, okay? So in substitution, the policy does not depend or depends very weakly on the properties of the environment. So there are sort of general solutions of the problem. But sometimes, which is also interesting, there are sort of transitions in the pollinative behavior of policies as parameters change, okay? Some of them trigger, some of them less trigger. Okay, very good. So we've been doing this for quite a while. What time is it? 19, okay. Good. Let me then move out of this, unless there are any questions. Yeah, please. Can I ask a question? Sure. So I didn't understand this. From the model, you said at T equals to capital T, if done, then zero reward. Yeah. But shouldn't it be either zero reward or minus C? I don't understand that. Minus C. Because you are on done and you get zero reward or you are on not yet and you get minus C for doing it. No, if it's the final time, there's no more time to do it. I mean, it's the day where you have to turn in the assignment, there's no more time to do it. So if you happen to be in the not yet, then you get this minus C prime, which is much bigger, okay? Okay, okay, capital. Okay, sorry, I understood. Thank you. Very good. All right, so let's move on because I want to discuss today with you a large class of problems in decision making where this notion of time has to be treated differently in the sense that there is no real clock ticking in our problems. There's no need to consider different strategies depending on how far we are from the deadline from the horizon of our problem, okay? And these are quite important and interesting. So they deserve to be treated separately. So before the break, what I want to do is to start out with some heuristics, okay? So we will do some non-mathematical reasoning about what happens in certain situations and why they can be treated with time stationary policies, basically, okay? And then in the second half, we will go through the math and we will understand how to write the Bellman's equation in these other new situations and how to solve it, okay? Which is a different technique from dynamic programming. Okay, to start with, let's consider our first example, which is the situation where there are terminal states. So what is a terminal state? Well, by definition, by definition, a terminal state, let me check this, so I don't mix up with the definitions. Yeah, so a terminal state is a state S such that whatever action I take, if I'm in state S, the probability of ending up in state S prime is one if S prime is equal to S and zero otherwise. So it's a state from which you do not exit. These states are also called coffin states, okay? So the coffin is the wooden box in which you end up when you're done, okay? So as unpleasant as it is, this is pretty much a close description of what happens in a terminal state. In an MDP, okay, the terminal states must have an additional property, okay? Which says that the average reward that you will get in the future, is equal to zero if S is terminal, okay? So this means that when you end up in a terminal state and you will keep on staying there, what happens from that time on is just you get zero, zero, zero, zero, okay? So that you don't have to care about what happens from that time on until infinity. So graphically speaking, so what's the basic idea? Well, you have your always your graph and then you take your actions, which may do different things. Send you here, send you there, over there, and then from here, you can go back. So I'm drawing lines more or less at random. But the important thing here is that we should focus our attention on this state here. This state is terminal because there are only arrows that go in and no arrows which go goes out of the state. Which is basically the intuitive definition of a terminal state. Now, what I wanna say here is that if the MDP ends up in a terminal state with probability one, okay? So this is a notation that we will use a lot. So with probability. So this is very important because if this property is met, then our objective function G can be written as the expectation of the sum over all times of gamma T, the average rewards. Okay, so this is the usual expression we've been using last time. Only that before we cut it short to capital T. So before we had the capital T as a horizon, now we can put infinity here. So this is the important change. So this is quite intuitive. This means that my process goes around. Sometimes at some point it will end up in a terminal state or one of many terminal states. There may be several of these coffin states. If it ends up, then it gets stuck there and it gets zero from there to infinity. Therefore, I can send my sum up to infinity without problems. And the fact that I have with probability one means that no matter what happens, even if it happens that it gets rounding around for an infinite time, which would make this sum diverge, this happens with probability zero. So I don't care about it. It's mathematical to say that basically I, with probability one in a finite time, on average, I will be stuck into terminal state. So why is this interesting? Because under these conditions, the horizon has disappeared. So without proving anything, it's sort of intuitive to think that that my Bellman equation does not need to consider all possible policies. So you remember, let me go back to the Bellman equation we were rewriting on the other file just a second because it's pretty lengthy to write, but here it is. So you see, you had a mapping from the value at time t plus one and the value at time t, but now there's no meaning whatsoever to be attached to these times because there's no horizon. There is no end of time or no origin of time, if you wish. So if there's no reason whatsoever why these two options should be different, which leads us, again, this is not a proof. It will come shortly after the break, but this leads to the intuition that we could write it, sorry, here I put the discount, okay, but that's fine. It's fine as well. Okay, so we know without discount. So before we have no discount, so maybe I can also remove the discount here for clarity and then we discuss discount separately. So I'm not messing too many things up. So this was as previously. And then what I want to say is that the Bellman equation now becomes, for this problem becomes simply just one value function, which is given by the maximum of all possible actions and the sum over s prime of p s prime s a times reward s a s prime plus v star s prime. And correspondingly, the unique optimal action is the r max of a of this square stuff here. Which I'm not reviving. So this simple consideration that if there is a terminal state, then I can dispense with the time dependent policies and my optimal solution would be time independent no matter what time it is. Greatly simplifies the problem, but still it seems to be something which is very limiting as it is because you just need to consider situations where you are with probability one ending up in some terminal states and some situations just don't look like that. First, second thing, you would still like to keep the idea of having some time horizon, okay? So being able to see, to make decisions depending on how far in the future you want to plan, which here doesn't seem to be the case. This is not what you're looking for when you want to introduce explicitly some notion of time horizon. So the idea to combine these two things together is to use discounting, okay? So I introduced discounting last time. So like I told you last time, discounting is focusing on objective functions which have a sum over time from zero to infinity with some geometrically decreasing coefficient, okay? So like I told you this gamma which is you comprised in zero and one. And of course the one situation you will be able to include is when there are terminal states, okay? So you can combine the two. But for the moment, let's focus on just on this range. This is expressing more or less how far in the future you want your sum to be contributing. If gamma is strictly zero, then this G just becomes the expected reward just right after the start, right after the start. Whereas when gamma tends to one, you have a sum which becomes extremely long, very good. So the key to connect the previous reasoning with the discounting is to realize that we can see discounting as essentially as a termination. What do I mean? Let's consider together the following case. So we have our MDP which we can describe as usual with actions. And whatever diagram you have it doesn't really matter what it gives. Now we are gonna add a new state which is a coffee state. We add to our original MDP we augment it with a new state. Let's call it void state, okay? This was not even the original MDP. The original MDP was defined here and then there is an augmented MDP with a new state. And this new state is terminal, okay? So which means that you get back to this with probability one. And what is the relationship with all these other states here? Well, the relationship is that you end up in this state with a probability one minus gamma. So whatever action you take here you end up here with probability one minus gamma. And this probabilities here are then multiplied by gamma. So basically you are modifying your MDP in such a way that every time that is useful if I write it in the other form maybe. So let me use this. So what is happening is that before you had the, sorry, before you had your state, your action and you were ending up in a new state, okay? So the policy was sending you here and then the B was sending you there. Now the modification is the following. At every step, okay? So this is your new state. At every step, basically you are changing these probabilities like this. And then you are introducing the possibility that you transition from here to a new state which is terminal with probability one minus gamma. So at every time step, basically you can think of this gamma as a survival probability. So if you were to simulate this Markov process at each step, you would throw a random variable and then decide upon that random variable whether you're gonna survive and go to the next step following your original Markov decision process or just you die, you end up in the coffin and then you're done. Then the statement is that this process is absolutely equivalent to this one. Why is that? Well, because this gamma to the power t is exactly the probability that you survive t steps because at every time you have a probability gamma of surviving. And then the probability of surviving t steps is gonna be gamma to the power t. So the statement which I'm not proving but it's quite intuitive is that this objective, discounted objective of the original MDP is equal to the undiscounted objective in the augmented MDP. So let me write it formally here. So we have the original MDP with transition probabilities P and goal discounted. So this is a short hand notation. And we have the augmented MDP, the green one, which has probabilities gamma P and has undiscounted rewards. Now, the final step is that this new process, if gamma is smaller than one, ends up in the terminal state with probability one or you're frozen again. Hello, hello, hello. Hello, there was a small freeze again I think. Okay, very good. Lucky enough, I am at work because if I were at home, it would have been my fault. Can you hear me? Yes. Okay, we have to reshare the screen and check if the recording is still on. It says it's on, at least on my screen. Yeah, we'll see what will happen. Yeah, apparently recording is on. Okay, apologies for inconvenient as well that I'm in my office. It should be the best connection ever but that doesn't seem to be the case. Okay, so where are we? So the point that I wanted to make here is that the introduction of this terminal state here makes this problem equal to the terminal state problem that we discussed before because if gamma is smaller than one, strictly smaller than one, then the system ends up in the coffin state with probability one. So this allows us to say that in this case as well, we can expect that the policy will not be time dependent. So this is the heuristic argument. But again, I'm not making any detail proof. I can prove that but we don't spend time over this because we prove it in other ways. But the basic idea is that according to this line of thought, I can conclude that my Belman's equation in this case reads as follows. So basically what has changed here is just that I am introducing this discount factor over that. So basically you can see that with this equation, I can consider all the situations between like this and in this because this correspond to my this augmented MDP with the terminal state. Or I can also use gamma equals one if there is a true terminal state. By true, I mean that in the original MDP, there was a terminal state already and this terminal state was reached with probability one. So this kind of description has the appeal of being more compact because now we have to deal with a single value function. But still it allows for this notion of horizon in the form of this one over one minus gamma which is the average survival time. So if you have a probability of survival which is gamma in your analogy, then this would be equal to the average survival time. Okay, so recapping this part. We have developed some intuition that if we introduce discounting, we can expect our problem to be described by a stationary value function and a stationary optimal solution. So something that doesn't depend on time. So after the break, what we do is that we derive formally this equation. So we prove that this is exactly the solution of the discounted MDP first. And the second, we discuss how to solve it because while we have gained a lot by removing time from the game, we also understand that we cannot use dynamic programming because there is no end to start with. We cannot go backward in time if the problem has no time. So we have to find out some other way of solving this equation. And this again, we will do after the break. Any questions, Safar? Often. Yes, please. Gamma, what do you wrote? Gamma equals to one if there is a... Sorry, if there is a true... Yeah, sorry, this was really very confusing. If there is a true terminal state. So if in the original MDP, not the augmented one, the one in white here, if there was really a terminal state like this one, then you could set gamma equal to one because this sum will converge anyway. Your objective here converges with gamma equals one if eventually these rewards become zero all the time. Is that clear? Yes, yes, okay, thank you. Sure. Any other questions? Now let's take a break until 10 past 10. Okay, and then we resume as discussed. Okay. Okay, see you after the break. See you after the break.