 And here we are. Okay, good morning everyone. So the plan for today's lecture is to discuss the mathematical setting of Markov decision processes that we introduced yesterday and to walk our steps towards writing down optimality equations in one of the settings that I described yesterday, which is the finite horizon setting, okay? So the mathematical procedures are slightly different in two cases. So it's useful to keep them into two different boxes. So for today we will be discussing just the finite horizon case. So let's start by summarizing a little bit the fundamentals of Markov decision processes of Markov decision processes that I introduced yesterday. So let me recall you what are the ingredients. So we are discussing, okay, I'll minimize. We are discussing a situation in which we have discrete states belonging to some state space. So these are the states. And there are actions, which can be operated from any one of these states. Then there are rewards, which are real numbers. They can, you can make additional assumptions as to these rewards being bounded or having some certain probability distribution which has finite moments, okay? So these are technical requests. We are not interested in sort of oddities about the structure of the rewards. So we will always implicitly assume that they behave as nicely as required, okay? In most cases you can think of them as being bounded between a certain minimum reward and a maximum reward. So these are the rewards, okay? And the structure is inherently Markovian in the sense that there exists a probability distribution for new states and observed rewards given the previous state and the previous action. Okay? So this is a transition probability. What does it mean? That is a transition probability. It means that all these p's are larger than zero. And then if we sum over all s prime and integrate over all dr, our p of s prime are given s a. This is equal to one for every state action pair, okay? So it's always properly normalized. And like I said, you can also make additional assumptions about how well it behaves, okay? So this is what is usually called the model of the environment. And then we have a policy, okay? So this policy is a family of probability distributions. And in the following, we will consider actually a generalization of this in the sense that the family will be possibly time dependent, okay? So at each time step of our process, so I remember the times are discrete. So you have a time zero, time one, time two, et cetera. And at every time step of this process, there is a probability distribution which maps states into actions, which means that despise our probabilities themselves. So each of despise positive and then summing over all a's, overall possible actions is always normalized to one. This is the, in general, the policy. This case is a time dependent strategy of decision making. So there are two ways of representing graphically a Markov decision process. One is in terms of a time directed graph, okay? So suppose we start from state S zero, then using policy by zero, we can extract one random action according to that probability distribution which gives us my action a zero. Then according to this probability distribution P above here, these two together generate a new state S prime, okay? And in the process, they also produce a reward R. Let's say R, sorry, this I push right down S one here because it's a time one. And let's call this first reward that is obtained, we can call it R one, okay? And these two arrows are given by the probability transition P. Is it clear to you what I'm writing down here graphically? Okay, it's the way that you actually would produce such a process. So if you were given these two ingredients, the policy and the transition probability, then you would be able to generate a sequence of states and actions and new states and rewards, okay? By this map. And then you repeat by one, and then you get an action A one, and these two together give a new state S two, and in the process you get a reward R two, and so on and so forth until some final time in which you reach a state S T, you get the reward R capital T. This was given by operating the action A T minus one, and then everything here ends, okay? So capital T here for today will be the horizon, the time horizon, and this is the end of the world, okay? So you start at time zero, at time T, everything finishes. So everything that could possibly happen later on is something that doesn't give any reward, doesn't give any new state and any new dynamics. So you can sort of consider only everything that happens up to that time horizon capital T, okay? So which means that our small T goes from one to capital T. Okay. Like I said yesterday, the control, the handle that you have on such a system is through the policy, just this object here, and this is the controllable part. Where on the other hand, this part here is not controllable. So there's no way the agent can move its handles in order to modify this transition probability P. The only thing you can do is select properly, select the actions to take at every time step in order to obtain a result, okay? So like I said earlier, there is just, there are two different ways of describing this process. So this was in time, but there's also another graphical description, which goes like this, the form of a usual graph. So you have several states, so I should not call them as sub one because this is confusing because it mixes up with the index of time. But you have several states, okay? Which I will call as S, S prime, okay? And maybe another state S second, okay? These are just labels given to states. Could be one, two, three, whatever. And then from each state, you can pick one among many actions. So these are all actions that are accessible from state S, and each of these action can send yourself either back to state S or to some other state with different probabilities, okay? So you usually write down this like, so this would be written here at the probability of going from S to S given A, and this would be the probability of going to a second, I may say, and this would be the probability of going to S prime. Professor, is there any chance of staying in the same state? Is there like a loop that brings it back? Yeah, this is one instance. There are actions that is bringing you back to the same state. This is absolutely possible. Okay, thank you. Okay, so and then of course, you can generalize whatever, okay? And this is a graphical description which reminds more of how you would describe a Markov chain with all the arrows that go from one state to another. Okay, what is left here to define is what is the goal of the optimization, right? So this is valid for any possible control, for any possible policy, but we are interested in formulating a problem in which we have an objective and we want to find the best way to decide in order to accomplish that objective. And in our case here, our objective is the following. So suppose, yeah, let me state it. So the objective is to find the arc maxim, okay? The arc maxim, okay? So it's the argument of this function over all possible sequence of policies. So overall possible pi zero, pi t minus one, okay? So this is the sequence of policies that you can take at different times, see pi zero, pi one. The last one that makes sense to consider is pi sub t minus one. You want to find the maximum of what? Of the expected value of the sum from t from zero to say, we said the last reward is capital T and one, I'll show you so this is in fact, okay? You can write it like this, R t plus one. Okay, there are equivalent ways of writing it, of course, if you shift the D indices, but that's okay. Okay, so we have to be clear about what we mean here. So arc maxim is that this object is of course, dependent on the policy among other things, okay? Through this expectation value. And these are random stochastic values in principle. So our system accounts for the possibility of this object being stochastic. The rewards could be stochastic or could be deterministic. The system is flexible with respect to this. Of course, also the dynamics could be deterministic, okay? So here this is very general. These are probability distribution, but this includes the sub case, the special case when if you do something, if you do an action, you end up in a fixed given state that this is included as well. So deterministic case is a limiting case of the general case, which means that in practice, optimal control theory is a subset of this description. Okay, so let's clarify what this expectation value means, okay? So this expectation value means that we are along this sequence. So we have a sequence generated by this process, okay? Which I can write down as a stream, state zero, action zero, then we get reward one and state one, and then we have action one, and this gives reward two and S two, and then from state S two, I pick action two, which gives me reward three and state three and so on and so forth until I end up with a capital T minus one, R, T, S, T. Okay, so this is a full history from beginning to end. What does this expectation mean here? Well, it means that we are taking the expectation with respect to, this is a notation which is very much used in stochastic processes and in mathematical statistics. So we're gonna pick our action 80 from the policy pi here. If I define it like this, the action zero is taken from the policy pi zero. So I have to keep the same time index here. This is taken from the policy pi T. We put a dot here and then we say from state S, T. Okay, so this means that if I'm in state S, T at time T, I pick 80 from the probability distribution given by pi T. This is the formal way to describe this. And then I pick S, T plus one from my P. Actually, I pick both of them. The pair of rewards and new states together, I pick it from P dot, dot previous state, previous action. And then the only thing that's left to describe is how I pick the initial S note. This is the only orphan here. This is, I have to prescribe some way to define it at the beginning. So finally, I have to define S note according to some initial distribution. Say we call it, for instance, a raw note of something. So this is described the way you generate the process. You are given an initial distribution raw note, you pick up an S, then look for your policy at time zero, pick up an A zero and so on and so forth. I'm deliberately very, very slow and redundant in order for you to understand what all the implications are here. So when we write this objective function here, again, I'm gonna rewrite this, this is a big object. Let's call this objective function for the moment, let's call it G. This is, again, I'm repeating the sum from zero to T minus one of the rewards are T plus one. So when we take these expectation values, one thing that we can do at the beginning is, first of all, take the averages with respect to the distribution of rewards themselves, okay? And leaving all the other things as they are. So which means that we take the average with respect to this first and define another object which is a slightly abusive in notation. So we define the reward for a state action and new state S prime as the expectation value where R is picked according to the distribution of R on specified S prime, even S A of R, okay? Which even more explicitly is the integral over R of R B of R S prime S A. Since we are interested in averaging, this means that from this point on, we are not really very much caring about what is the actual distribution of the R values as long as we are only interested in optimizing averages, okay? Notice that the fact that we can ignore the probability distribution and only focus on average is a distinctive property of the fact that we are working with a known model of the environment because when we will deal with learning, probability distributions will be very important, okay? So the way that rewards are distributed has a strong influence on learning because if your rewards are deterministic, then you just need one shot to learn what the value, the mean value is. But if it's highly noisy, you will need to average over many things. But since now we know everything, we don't need to sample. And so since we don't need to sample, we don't need to carry over all the information about the probability distribution as long as our focus is on average values. Like I told you yesterday, there might be different objectives, okay? Maybe you're interested in something which is about risks. So you want to avoid to have very large and very large negative rewards. So there's very punishment. So you don't care about what happens in the average, but you don't wanna get some large punishment. In that case, you would have to modify these things and these objectives will be dependent on the probability distributions, but we don't discuss that here at the stage, okay? So from now on, we can ignore basically this dependence on the distribution of rewards and we can rephrase our goal just as a simpler expectation value. Again, we pick our actions according to the policy at time T and we pick our states according to P Okay, for this one, okay, again, I'm slightly abusive in the notation, but this object here, this is just the marginal distribution of PRS prime SA, okay? So I should give it another name, another symbol, but let's not make it too heavy. So what I mean by this is, this is in fact, P S prime SA is defined as the integral over R of P of R S prime SA, okay? So I'm still using the same symbol to say that, forget about the outcomes of R because what we will be caring about here is just a S node is still depending only on the role of this. And here we can replace that sum as the sum from T going from zero to capital T minus one of now this averages, average rewards, which depend on ST, AT, yes. Okay, so this is very heavy as a notation, but in the following, I will just make it much, much leaner. You always have to go back and think about what does it mean to take expected values and what it means in practice, okay? Very good. So like I said, the objective is to formulate, yeah, right? So this is my G, which depends implicitly on pi, okay? Which means it depends on pi zero pi T minus one. Okay, so we will now slowly set the stage to understand how to solve this problem, okay? The goal for today is to solve this. What do I mean by solving? Well, solving here means finding an algorithm that for whatever choice of your model, that is for whatever choice of the distribution P, identifies which is the sequence of decisions that have to be made. So you fix an horizon, you fix some horizon here, you fix this, and then you're asked, what is the best control in right here to achieve this goal? How do I compute it? Okay, that's the goal for today. So before we go to that, which will require a little bit of calculations, not much, but it's important to go then step by step. It's perhaps good to get our feet, again, back on the ground and think about one very simple specific example of a Markov decision process, which is a simplified version of the cleaning robot system, which also can be used as a playground to solve this kind of problems that we are interested in. So we take a small break from formalism and introduce an example of MDP, which is also discussed in the book by Saturn and Bartow and there it's called the recycling robot. Okay, so the recycling robot is a relatively simple object. It's, let's start from the description of the states and actions that are available to this robot. Okay, so I have to go around here. Okay, very good. So the set of states for the recycling robot is very simple, it's made just by two states, okay? And these two states are the levels of the battery, okay? So the only thing that matters in the relationship of the agent with the environment is whether the agent is at a level of charge which is high or low. Nothing else matters around. So it's a very simple description. The only characterization of the environment is that whether the robot is high on charge or low on charge. And what are the things that the robot can do? Well, the robot can do different things which might in principle depend on whether it's in high charge or in low charge. For instance, the state of actions in this example when it's on high charge, it can do two things. First, it can go around and search for rubbish, okay? Or for glass bottles, okay? Just since it's a recycling robot that's a more proper description of what the goal is. So it moves around and searches for empty plastic bottles, okay? To recycle. Notice that here we don't have any specific spatial structure, okay? So this is a very simplified situation where there's no moving around, taking one step in one direction, et cetera. It's a navigation problem, but in a zero dimensional space, okay? The decisions are more about what to do than where to go, okay? So it can search or it can wait. What is this weight action? Well, the weight action is that the recycling robot just stays in place. Maybe because it's low on battery, okay? And so it's not a good idea to move around, but nevertheless, if it stays in place, maybe someone is walking around just is willing to stand up from here or his desk and take the plastic bottle and throw it into the recycling robot, okay? So it's a proper action. It's just not moving around but waiting for someone to fill you in or as we will see waiting for someone to take the robot and put it to recharge station, yep? Since this action is for high battery, wouldn't that eliminate the waiting because of low battery? Or will those two overlap in actions when the battery is low? Okay, we will see how this problem is constructed. You can have very many variations depending on what is sensible or not. This is one out of many. I'm following the book so you can look back at the details over there as well. But of course you can sort of build up your own set of states and actions as rich as you want, but I'll just consider this particular example. There might be other more sensible choices, I agree. So when it's low, it's very similar in the sense that the kind of things that you can do when you're low on battery, you can do the same things, but you can also go and recharge. Okay, there's a third available action. So this defines the states and the actions. Now we would have to define the rewards, okay? Which in this particular examples are not stochastic, so we don't care about necessarily about, like I said earlier, I mean, they can be stochastic, but since we only care about the averages, we would have to come up with a table in which for every state, action, a new state, you put a reward, et cetera. So but we don't go through this lengthy operation or writing down a table of transition probabilities because I mean, I can do it, but it's gonna be a one page of elements. You can do it by yourself. What I will do, I will jump directly to the graphical description of this because it's more compact and it contains exactly all the same information. Okay, so the graphical description, like I said, is rather simple. There are two states, high and low on battery. And like I said, if you're high on battery, you can take the action of waiting, okay? So reasonably, if you wait, you don't consume charge, consume battery, so you keep staying on the same charge state, okay? So the probability with which you go back to high here, this is the probability. This object that I'm writing here, one, is the probability of being in the high state, given that you were in the high state and you took the action of waiting, okay? But in short, I'm just writing a one. If you decide to wait, you will still be high. There is no leakage of battery if you wait. Again, this is just one choice. You could have done otherwise, but just to fix the ideas. And then if you do that, you get some reward, our wait. And this is, you can think of it as a, for instance, at every instant of time, this could be a zero or a one, the random reward, zero if nobody drops a bottle in the recycling robot, the basket, or one if one drops a bottle, or 10 if one drops 10 bottles, okay? This our wait is the average number of bottles which are delivered to the robot when it's just sitting and waiting somewhere, okay? If the robot decides to search, then two things can happen, okay? Here, there is some energy consumption which we can model as being two possible outcomes of this search. Maybe the search is relatively short and therefore it gets back to the high level of charge or maybe it's long and consuming and it takes to a low level of charge. Again, the level of charges is a continuum. So here we're making wild approximations but it's just for the purpose of setting a toy model. And then in this case, you can have an R search and maybe this reward that you get for the search is higher than the reward for the wait because you move around and then you can collect more water bottles for people who are too lazy to stand up from their desk and collect them, okay? And this alpha is the probability with which you will still be high on charge after a search. And of course, if you are not high on charge, you are low on charge with a complementary probability but the same reward as before you will get. And then, okay, now we can go quicker. We can also have the same kind of actions for the low state. So we can search or we can wait. And again, if we wait, certainly we don't get any more battery. So we get the probability one to low and we get the reward R wait. And if we search again, we have the probability of going to low. Now, since we were low, maybe the probability of being still low is higher which is another number beta. And then we have R search. And then if we are maybe, maybe how do we go from a low level of battery to a high level of battery? Well, maybe because there are someone who is helpful enough to see that the robot is low on charge and then picks up the robot then carries it to the charging station. Clearly if this happens, this means that there has been some human intervention. So maybe here for this process, you don't have a positive reward, but you have a cost C, okay? So this is something where the reward means that if it happens that your robot switches off because it's out of battery, it's not low any longer, but it's really out of battery. Then someone will have to pick it and this will have a cost. And then finally we have the last action that we introduced the recharge one. And this spontaneously means that the robot has decided to go recharge before the battery dies down, okay? So if it decides to recharge, then it will succeed. So probability one, but then this clearly gets no reward because on the way to the recharge station, it doesn't care about bottles coming in, okay? Just doesn't look for bullets. Okay, this was explicitly meant to be a simple, very simple toy model, but this is how you could possibly describe at a very coarse grain level any kind of decision making process, okay? You can clearly see this is meant to provide you a blueprint. So if you have an actual problem in decision making, your blueprint is to go through these steps. Define meaningful states, define meaningful actions from the states, define how from a state taking an action you end up in other states, define with which probabilities you get there and with what kind of distribution of rewards you will get there. Designing the structure of a problem is a key step. And this is one simple particular example. So in this case, clearly the formulation of the problem is the same. The goal for instance is set the horizon to say 100 steps. Example, what are the best policy? So what are the probability distributions of erections that maximize the sum of the rewards obtained overall, which means that the average number of bottles that you can collect minus the costs that you are having incurred in when you were, when the robot was carried over to the charging station by some external operator, okay? Hope this is clear to you. Very good. So at the end of today's lecture, basically we will have a way to approach this problem operationally and to find out what the actual policy is. I can tell you from the beginning that the algorithm that we will come up with is very simple, but even in the simple case like this one, the one that I depicted, you cannot solve, you cannot find the optimal solution analytically. Finding analytically optimal solutions in market decision problems is a rarity, okay? So when we will stumble upon a situation where you say the optimal policy is something which is unusual. The typical case is that you have a very simple clear cut procedure to solve the problem which anyway will require some numerical implementation as an algorithm, okay? So I think it's a good point to stop here before we delve into the actual calculations in a second. So are there any questions so far? Okay, good. I hope that's because it's clear and not because it's totally obscure, but one never knows. Okay, then we can take a break and we can reconvene at 10 sharp. Okay, see you later. Okay.