 And here we are. Okay, so with the previous lecture, we completed our excursion into the Bayesian setting for reinforcement learning. And so from today, we are gonna move towards the more inductive or database interpretation of reinforcement learning. So all these methods which are relying on data to perform prediction and control. So to recap where we are, where we stand, let me draw once more our map, which features two axes. Our level of knowledge of the model that is underlying our system, the dynamics and the structure of rewards and our observability of the state system itself. And we were sitting here in the upper right corner when we discussed Markov decision processes and the Bellman's equation. And then in the last series of lectures, we've been discussing how to move down along these axes to build up a broader theoretical construction, which is this notion of partial observable Markov decision processes, which contain Markov decision processes themselves as one specific instance. That is the instance of perfect observability. But in general, when observability is partial, we need to combine our planning method with inference in order to be able to plan taking into account the fact that we have to accumulate observations in order to be able to predict in the future. So this naturally calls into the game the notion of Bayesian updating. And we've been discussing all this quite a lot. So there are other exercises that can be done with partial observable Markov decision processes, both small and large. There are some applications that one can think of, algorithmic and more conceptual. So in case you are interested in developing specifically this part of the course of classes, just contact me and I can point you to further references and maybe it can be a small project for the exam if you are interested in that particular in that Bayesian setting. All right, but now we're gonna take a very different path. So we're gonna move in the horizontal here. Okay, so remember that eventually we would like to come somewhere here, that is where the full reinforcement learning problem is sitting, which implies a combination of difficulties that come from lack of knowledge of the model, lack of observability and all the rest that is real life, okay? So when we move on the horizontal, the radical thing that is changing now is that we give up this notion that we know the model and therefore we can plan. So far in either of the two flavors, either classical MVP or partial observable MVP, everything was a problem of planning, okay? So you must think of the decision maker as adopting the mindset of the chess player, okay? Something that ahead of things happening is planning one, 10, no matter how large number of moves ahead of time and trying to deductively include all possibilities somehow and to plan ahead a sequence of decisions, okay? Based on these observations yet to come, okay? So it's a problem of essentially computation and planning. But when we move on the horizontal towards the, so the upper left part of this diagram, we are explicitly assuming that we do not have all this knowledge about the model. So we have to replace this a priori information with some a posterior information that is now we are really interacting with data. So the key concept that comes when we move along that direction is that we really now start about talking about learning, okay? It's, of course, this doesn't mean that we are totally dropping onto the notion of model completely, okay? And this doesn't mean that there is a clear cut frontier between all these methods and techniques. It's just I'm using this approach to simplify, sometimes oversimplify this landscape in order to, so you have some clear landmarks in this, in this reach landscape. But of course there are many methods which are a mixture of several things and words. So it's important I think for you to sort of rationalize all these approaches using this simplifications. So what are we gonna do for the next, I guess six lectures more or less is to slowly build up all the concepts and techniques that will lead us to this upper left corner of the problem, which is where techniques that go generically under the name of, let's call it broadly speaking, model free learning and control. But we have to be careful by what we mean model free. But the technical speaking is that we are trying to avoid the assumption that we know what the transition probabilities and what the rewards are in our system, okay? What does it mean that we're still keeping ourselves at the top of this graph? It means that we are still assuming that we know the structure of the state space and the actions, okay? So we have some structural information about the system. So we know what the states are, we can observe them exactly. So we know where we are in space. If it's a grid the world, we have this information about the state, but we don't know what the rewards are and we don't know what the transition probabilities are, okay? So we will go through some examples in the following to clarify what does that mean to be model free and try to learn and control, okay? And as a matter of fact, I think that's already a good point to discuss some examples, okay? So what do we mean by being here? Yeah, so first example, grid world. So you've seen this example several times and we will use it again as a playground to see how to learn to control the systems without knowledge of the model. So what does it mean to have grid world in a model free setting? As you remember having to solve the Bellman's equation by value iteration in grid world implies that you know what are the consequences of your actions. So where do you move with how much probability you move to another tile of your system, okay? So this is as usual depicted as a collection of tiles and you know also the average distribution of rewards as you encounter a triplet of states, actions and new states in your path. Grid world without a model means trying to understand where to reach the rewards, okay? So you remember there are some rewards placed somewhere like here and here and then there are some regions which are forbidden, okay? So there might also be some penalties along the way, right, which we could call like minus rewards. So points that you want to avoid, points that you have to avoid and points that you want to reach, okay? Starting from every place in your domain, in your micro world. Now, model free learning and control for grid world means that you don't have this map, okay? Your map has no feature at all. So forget about this. If nothing like this, you just have an empty map. This means that what you have in your hands at the beginning is just this grid, okay? So by this I mean you have some structural information because you know what the grid is, okay? I'll tell you that you are doing grid world in a grid which is, I don't know, 15 by 10 or in a grid which is 10,000 by 1,000. I give you what the states are and I tell you what the actions are which could be North, South, East, West, okay? Or whatever combination you like. But I'm not telling you, so, or you as a decision maker don't know where the rewards are. You don't know where the stumbling blocks are, okay? All of this is unknown to you and you also don't know what happens when you take an action. So you don't know if you take North, if you go North, you don't know with which probability you will actually go North. And you don't even know if you will make just one single step North or two steps. All of this is hidden to you. So the only way to discover this and put it to use in order to be able to recover the optimal strategy is to interact with environment, okay? So, and this requires of course to do some trial and error, okay? So the basic idea is that we have to formalize this notion of trial and error in terms of attempts, okay? So trying to do things, collecting information where we do it and optimizing our way of behaving as we learn, okay? So there are two, these two aspects, learning, learning to predict what will happen in the future, first step. So learning to predict and control. These are the two key aspects that we have to deal with. So in an MDP, you do that by using a model. So simultaneously, if you solve the Bellman's equation, you are predicting what will happen in the future because you know what's the value of your optimal value function and you're optimizing over it at the same time and you're controlling. In policy iteration, you're doing alternate steps. You're improving your prediction by evaluating a new value function and you're improving your policy by going to the next step. But this is, again, it's just computation. Here, we have to learn to do it by collecting information as we act, okay? So second example, again, bandits. What does it mean to be model-free with bandits? Well, it's very simple. The idea is that when I tell you that you are doing with coins, okay? So Bernoulli bandits, I'm already giving you a model because I'm telling you the outcomes will be zeros and ones. I will tell you how the rewards are done. I'm telling you that how the transition probabilities are constructed, okay? So for instance, being model-free in bandits might be that some of the information is known. For instance, I could tell you that these are sort of stationary slot machines. It says that they don't change their distribution no matter what you do, okay? I might tell you this. So this is information about the model, but at the same time, the probability distributions of your states given the action is totally unknown, which means that you don't even know if it's Bernoulli, okay? Maybe that if you pull an arm, you get zero, one, or maybe two, or maybe 100. So you totally don't know what the distribution is. Could be Bernoulli, could be Gaussian, could be whatever. You have to discover that, okay? So the level of knowledge that you have is significantly less than the previous case. It's not totally model-free in the sense that I'm telling you that, okay, these slot machines don't change over time, but there's also, you might also want to address the more difficult problem in which you say, okay, I'm not even telling you this. So maybe overnight, the casino owner changes the slot machines, so you were playing a slot machine and then the day after you think you're playing the same slot machine, but that's changed, okay? So this might also be a more, even more complex model-free problem, but already the one when you're given this, so this is okay, but this distribution I don't know is already changing in itself, challenging in itself, all right? Okay, any questions so far? Okay, it's important that we sort of delineate the path we are going to take. So for bandits is unknown class of probability distributions. Can I ask something? Yes, please. But here, we will have a knowledge of the history. We will have another? We will have a knowledge of the history. We need to have some way to collect our information, yes. As we play, we have to write down what has happened in one way or another, okay? So our policy will be a function of the history, but this is about the real history of events that are happening when I interact with environment. It's not like in the Bayesian case where it's a list of possible histories to be seen in the future. It's something that has already happened and you just taking notes, okay? So it's a frequentist approach in this case. Is that, was that the question you were asking? Yeah, yeah, yeah, thank you. Sure. Okay, very good. So how do we move from here to here? Okay, from this situation to this one. So in order just to explore the different possibilities, let's start by recalling what we know. So we know that if we have an MDP, we can solve the problem by solving the Balmance equation. Okay, so one possibility is to be here in the middle and to collect information about your system, okay? So to interact with environment, you do things. You choose some policy and you start doing things. And as you do things, you keep a record of what has been happening and you use this record to construct a model, okay? So the flow of this sort of conceptual attempt would be collect data, build a model, solve Belman, okay? You see the area is if I collect data about my environment, I can construct an empirical model and I can use this model to solve the approximately the Belman's equation that comes from this model, okay? So how does it work in practice? So for instance, in practice, what you would do is that you remember, for instance, for transition probabilities, okay? So what is the probability of transitioning from a state S to a state S prime given an action A? So what is it in practice? Well, in practice, this is the expected value, okay? Suppose you had to compute this quantity by making a Monte Carlo simulation of your system, of your Markov chain. What would you do? Well, you would say, okay, every time that my transition brings me to a new state S time t plus one, which is equal to S prime, I put a one, okay? And this is, I do this every time that my previous state was S and that my action was A, okay? So this is another way of writing the transition probability. You see that? And this is a conditional expectation, okay? Which is in fact the probability. So it's actually the ratio of the joint probability times the probability of the event over which we are conditioning, okay? So this is also equal to the expected value of what? Of units if the three things happen divided by the expected value, okay? So what do we have here? At the numerator, we have a function which is one only if and only if the triplet states action and state prime is exactly S, small s, small a, small s prime, okay? So every time that in my trajectory, in my diagram, every time that I work over this triplet S, A, S prime, I count one, okay? And I divide this by the total counts of the times that I was starting from small s and small a, okay? Do you all agree with what I'm writing here? So this is just the way to rephrase transition probabilities in terms of events that are occurring along one specific trajectory of my system. Okay, if you are happy with that, then the idea is that now we're gonna construct, suppose that we have now n trajectories, okay? By a trajectory, I mean a sequence state action, new state, sorry, okay? New action, new state, and so on and so forth, okay? So this is one trajectory and I have n of them, okay? So let's put a label i here, okay? With i goes from one to n. So what am I doing here? I take my system with some policy, I don't specify it what it is, doesn't matter, as long as it produces actions. And I start from a state, pick an action from my policy and observe a new state, and then I pick another action, observe a new state, and I do this, and this will be my i equals one trajectory. And then I repeat that, starting from a new state, perhaps, whatever, depending my initial distribution of states, I pick one at random and then I start over again with the same policy. And I construct a large, possibly large number, capital N of trajectories. And then I can construct from this an empirical model, okay? So this is what I would call my empirical transition probability, what is that? Well, I'm just replacing, I'm just going to replace these expectation values with the empirical means, okay? So this is defined as some i going from one to N of the objects, which are to be counted, okay? This is just the number of counts, basically, what I'm putting here on the numerator, and this is another number of counts. And then, as you can easily imagine, I can do the same thing for the rewards, okay? So I can construct empirical rewards for the triplet state action state, which are just, as you can easily imagine, this is the reward that I experience, the stochastic reward. So this is what happens if I do one trial. So here I sort of forgot that I have all these indices for the realization for the time. So at the trial number i, I see a reward t plus one if I happen to be on the triplet s prime s a. And this again has to be normalized now by the time, the number of times that I visited that state, the triplet, sorry, okay? So this is the empirical rewards. These are the empirical rewards. This is accomplishing our first part that is collecting data, okay? So with some policy given, okay? So there is no optimization here, okay? There's no trying to optimize over the policy. We just choose one policy, for instance, random. So we pick actions at random at every time. We do stuff without trying to optimize. We act totally at random and we collect data, okay? So no attempt at optimizing, but with that we build a model for rewards and transitions. And now with this, we can use this to solve the bellman, okay? So is the path clear? It makes sense, okay? It's what you actually do a lot, lots of times in physics, right? You collect data with an experiment, then you will build up a model according to that experiment and then you ask questions, okay, what can be improved about the system? How do I optimize the functioning of my engine? For instance, okay, I collect data, I construct a model of what is happening and I use this model in order to infer in a way or another how to improve the performance of my engine, okay? So it makes perfect sense. What could possibly go wrong? And actually the answer is a lot, okay? So there are some critical issues with this approach, which doesn't mean that these issues cannot be solved. It's just like the way I laid it out, it's just collect data and then use the bellman equation is not the proper strategy, it has to be modified somehow, okay? So what is the problem? Let's go back to the problem of bandits. Look at bandits. So we're now gonna ask a simplified version of this question in the sense that, suppose you know that these are honest bandits, okay? So they are stationary in the sense that the states don't change, okay? So you know already something about the model, but you don't know the rewards, okay? That's our assumption, okay? Then let's say that you have two armed bandits, okay? So what would this plan amount to be doing, okay? So like I said, we don't need to construct a model for the transition probabilities because we know them, but we have to construct a model for the empirical rewards. But what are the empirical rewards here, okay? So the states don't change, so you don't care. The only thing that matters here is what actions are you taking, okay? So in practice, in this case, your model of the actions you're taking is just you collect the rewards that you get when you do that action and you divide by the number of times that you've taken that action, okay? So these amounts to say that I repeat my experiments with whatever choice of actions, random, for instance, I repeat it 10 times and then I collect rewards and I put these rewards into a vector with an entry which equals the current action that I'm taking, the current arm that I'm pulling, the current coin that I'm tossing. And I divide by the number of times that I pulled that coin, okay? So but this is just nothing but the empirical average for that coin, okay? Out of an experiments, right? So at this point, I need to solve the Bellman equation with the empirical averages that I have so far. And the Bellman's equation tells me that the best action to take is just the arg max. For instance, if I have just two coins, the arg max of the empirical average for the first coin and the empirical average for the second coin. So to recapitulate, if I apply this strategy of building a model and solving the Bellman's equation for the two arm bandits, whatever distribution of the rewards we have, what I do is just, okay, I decide, let's go for n coin tossing or whatever. I mean, need not be coins, okay, whatever. I go for n extractions, capital N. And I observe empirical averages for one option and for the other. I compare these two options and then I decide that the best action to go is the one that are mined by the empirical average, which is the largest. But this is nothing but something that we already discussed is basically the explore, then commit algorithm that we've been discussing maybe at the first lecture, okay? And this is clearly not a good idea. Why is that? Well, because it might happen just by, because out of bad luck, it may just happen that what was the best option in over 10 trials or over 100 trials, it may seem to be the worst option, okay? So suppose you have coins, there is a finite probability that after 100 tosses of the coins, if your coin number one was at a bias of 0.6 and your coin number two had a bias of 0.4, it's possible that the coin number two seems to be the best after 100 trials, okay? Of course, if the two coins are very different, this probability will go down very fast. Actually, it goes down exponentially with the number of trials, the probability that you misclassified these two coins and the rate at which it goes down, it's given by the Kulbach-Leber divergence between the two distributions, okay? So if the two are very different, this probability of misunderstanding one coin for another gets very quickly low. But if the two parameters are very close, if the two bias are very close, you need a lot of attempts in order to be able to tell apart which is which. And you don't know that in advance because you don't know the model, okay? So you clearly see that there is a pitfall in this mechanism which tells you that you have to do something better here, okay? And what you have to do better is in fact that you have to intervene at this level. So when you go from here to here, you have to account for, you need more exploration. So in practice, what happens is that you have to modify your model by taking into account your uncertainty about what you've seen so far, okay? So you have to sort of introduce some additional bonus that accounts for how confident you are in your model. So at this stage, I'm not going very deep into this because we're gonna do something different from the start. This is just to tell you that a very naive approach to the problem of optimization based on empirical knowledge is very risky. And actually it is doomed to fail even in very simple situations. So one has to be extremely careful when one mixes optimization with the empirical objects. If you want from a formal viewpoint, the trick, the problem, the tricky part of it is is that you want to perform the maximum operator which is strongly nonlinear of an average, okay? So you take an average of something and then you take a maximum. But this procedure is switching the two terms because you're taking the max before you have completed the average, of course, because you have a finite sample, okay? So these kinds of problems are very common in machine learning. When you deal with optimization in presence of a finite sample, you have to be extremely careful. And this is if you wish just another instance in which the problem becomes particularly conspicuous, okay? So this was just to tell you that you can go this way, okay? You can go somewhere here and build a model. But this requires particular care because you have to build a model, modify accounting for confidence, statistical confidence. That is how many times did I actually observe that thing? What is the my range of confidence that I have in that variable? You need statistics in order then to move to solving Bellman equation, the modified one. And then you have algorithms that are approaching the solution of your problem, okay? But as a matter of fact, this is all this part that I described to you is on one side to make you aware that there is a path to solving the Bellman's equation with empirical data, but also to tell you that we are not gonna do that because there's another way of doing which basically bypasses this problem. So it takes a path directly from collecting the data to solving the Bellman's equation without building a model, okay? So all of this part was actually a motivating example to approach this problem for a very different angle. So the plan for the next six lectures is how to do this. So how to introduce methods that just by trial and error by interacting with the environment find the provably convergent solutions to the Bellman's equation. So in short, how to solve the Bellman's equation without knowing what the model is. And you remember the Bellman's equation is a nonlinear equation in which the parameters of the equation are the model, okay? So you're asking quite what seems to be a very difficult question. So solve an equation without knowing what are the coefficients in this equation. But what we've tried to show you today and in the following days is that this is indeed possible and it's also possible to do that efficiently which is the cornerstone of temporal difference methods, okay? So before taking a break this was the first point that I wanted to highlight in which I'm going to write here for future reference as well. I managed to get a hold of this. So solve Bellman's equation without knowing the model, okay? So like I said earlier solving the Bellman's equation is trying to tackle two problems at the same time prediction and control, okay? So in the end, we will describe algorithms that solve these two problems together. So you will be able to write down a code that learns how to predict and control at the same time through learning. But in order to make the ideas clearer for now, we are gonna just split these two problems into parts and in the following. So today and tomorrow and in the tutorial we now focus on prediction first, okay? So for this first part, first let us assume that we are given or we choose a policy pie. And we are not interested yet in optimizing over this policy. We just want to learn, so the question is learn the value function of this policy without a model. Can we do that? It's clearly a scaled down version of the full problem because we are not asking what is the best policy and what is the best value function. We're not looking for optimality yet, we're just looking for what does it mean to learn the value function of something without a model. So that's the plan that we have that's ahead of us for the next two lectures, three lectures, any questions about the overall plan? Just in the first part of optimality when you... You stop me when you're okay with the slides. No, no, just a general question, not about the slide. Okay. It's not optimal even if for like a large number of trials or anything. Okay, you're asking about the fact that if what I told you that it doesn't give you an optimal solution. Yes. Yes. So what happens is that there is a... There's clearly is a dependence in the limit, okay? The problem is that the number of experiences of episodes, capital N that you have to produce in order to be sure that your optimization is correct depends on the gap between the actual averages, okay? So if your two coins are 0.49 and 0.51, then you will need hundreds of trials. But if they are just split apart by 1,000, then you will need a million trials, okay? Roughly speaking, because it goes like the square root of the number of trials, the precision inverse of the square root of N. So the problem is that you don't know this in advance. So what you want to do is just to give you the idea is as you try, it's okay, continuously update this knowledge. So don't rely on something like I need capital N trials. I keep on computing and while I compute, I increasingly do more of what I think is best, but still keeping some room for exploration of what it appears to be worse, but maybe it's not. So you want to do this dynamically online, okay? Thank you. Sure, no problem. All right, so I think it's a good time to stop and take a break. We can start again at five past 10, sharp, okay? See you later. Excellent. So we will now start laying out the all the necessary techniques to address this first question. That is, how do we learn the value of a policy without having a model at hand, okay? Just by my trial and error, if you wish. So the first thing that you have to realize that here by separating the two problems, we've been sort of getting rid of a lot of the issues that I was discussing before in the sense that if we don't care about optimization for the first part at least, then we can use Monte Carlo in order to predict the value of a function, okay? So there is one easy solution of this problem. So answer one, use Monte Carlo. So what does that mean? Well, in our specific case, we just have to remember that the value of a policy is just the expected value of the sum of the discounted rewards, condition on the fact that the state, the initial state as node is small s, okay? That's the definition of the value function. What I do expect on average, if just to be explicit, if I pick actions according to the policy and I pick new states according to my transition probability, okay? That's all I need to specify and my rewards, if you wish also, my rewards are t plus one. Also they depend on the joint probability distribution of rewards and states here. So it's a little confusing, okay? So these two dots means that there are two variables here which are the rewards and the new states depending on the previous actions and states. Okay, so for any policy, I just can implement Monte Carlo, that is I pick a state s and I run my system, my model. I don't need to know what the transition probabilities are. So as long as I have samples, I can compute approximations, right? So it's just pretty much like I did before. I can construct and estimate of this, which is just the empirical average of this sum of discounted rewards, okay? So this is just gonna be empirical average. I'm not writing the formula because I think you understand the idea and this is fine. As long as I don't have to optimize, this is perfectly fine, okay? So we could stop here and say, okay, problem solved. We have a way to compute the value of a function approximation. Of course, if we increase the number of trials, we will be improving our precision pretty much like one over square root of n. And in the end, after we have collected a certain number of trials, we have a good approximation, provably convergent in the number of trials to do the actual value of the function, okay? So what's not okay with this? So there are two, actually two issues, which are not a deal breaker, okay? So it doesn't mean that this method doesn't work. It means that there are two reasons for which we might want to look for something better than this, okay? So since today I see you're a little bit sort of sleepy, I'm trying to evolve you a little bit. So I'm gonna ask you for a suggestion. So can you identify the couple of reasons by which why not Monte Carlo? I will list the advantages. So the advantages of Monte Carlo is that this object is unbiased, okay? Which means that the expected value of hat V is equal to the true expected value, okay? This is what an unbiased estimator is. It's a quantity which is constructed from empirical observations whose average over the distribution of observations is exactly as the expected value of the random variable that is underlined, okay? So it's unbiased, very nice. It's simple, it's easy. I mean, zero formula, okay? And we could wrap up and go home and say we're happy with this. So it's simple and it's unbiased. But so what's the but? Can you come up with ideas why this might be? So think about how you would do it. Suppose I give you grid world with 1000 states and gamma equals to 0.99 which means an horizon of about 100. Okay, so after roughly 100 states the problem is quickly fading out because you have either killing of the process or discounting of the rewards, okay? So think about how you would do it in a code. You would sit down and say, I have to, as I say, I'm giving you the random policy. So at every point of your grid world you take actions at random, okay? One fourth, one fourth, one fourth in each direction. And I'm asking you, you don't know what the rewards are. You don't know what the sampling blocks are. You just have to experience. So I can, in practice, the way it works is just like I act as an environment, okay? Or the program. So there's a hidden part of the program, a hidden function in which you, if you interrogate this function, this function will return you the new states and the new rewards. But you don't see how they are generated. You just see the outcomes, okay? That's the name of the game. You don't know the model. I know it. I produce the function which you cannot read, but you can read the outcomes of this function. Professor. Yes, please. So we calculate the empirical average. And in Monte Carlo, we need to know the empirical average that was calculated in the previous step, right? Or maybe I'm confusing. We are, okay. We are perhaps running a bit ahead of schedule here. So what do you mean exactly in the previous step? Is it still a marker of decision process? Okay, we are approaching that, yeah, right? So we are start getting close with these questions, okay? So Haya implicitly touched upon exactly the points that I want to insist on. It's still in the form of questions, okay? So let's try to expand a little bit on this. So let me finish with this description of these thought experiments that we're doing. So we start from a state and we have to produce our simulation. So you interrogate the article, which is this function which gives you states and rewards and you go on and on and on. And at some point you stop because your discounting is going down and then you will have to restart from the same state. If you want the value of function at that state and produce another sequence of this, okay? So, so. I don't know how to formulate the answer right now, but I see the problem with it. Okay, there's one problem which is actually borders on trigger. That's why I think you're not identifying it because maybe it's too simple. And the first thing is that this is damn, damn expensive. It's long, okay? If the horizon is, it comes at 0.99, you have to run 100 steps for every state, a large number of times in order to kill the uncertainty. So Monte Carlo is long and it has a high variance. It has a high variance because for every point you have a large number of possible trajectories, okay? Some trajectories may miss the target. Some others may go on the target several times and quick. So the possible outcomes of this sum here in general may be very different and you have to sum collect over many, many trajectories in order to make this quantity converge to its average, okay? So in general, Monte Carlo methods are fine because they are unbiased and simple, but you have to pay a price which is a price in computation, in efficiency, in terms of realizations, et cetera. So- One question, can I ask a question? If I understood well, since I never encountered Monte Carlo, also my bachelor, is Monte Carlo is to follow a specific path and to cover all the possible path that I have in my problem? What I mean by Monte Carlo is actually, sorry for the jargon, it's okay if you stop me at that time if you mean something that you're not familiar with. So by Monte Carlo it means just simulating your system, okay? So you just, that's exactly what you would do if I told you here, I'm giving you a routine, a function, that if you provide me as a user, your current state and action, which you choose, okay? You choose an initial state, you choose an action and my function gives you the next state and the reward. Okay? And then by Monte Carlo means that you're just gonna take in these things and call my function again, okay? And so you produce trajectories. That's what Monte Carlo means. You just produce trajectories with some generative model that you might or might not know. In this case, you don't know. It's hidden to you. Is that clear? Okay, good. Yes. Thank you. And we use some kind of sampling to overcome this problem. Do we use some? Sampling. Sampling? Sampling, yeah. I mean... Okay, you could use some trick like importance sampling to improve on over this, okay? Which is one way of improving. And this is also canonical in Monte Carlo methods. So you use some other distribution in order to produce your data and then you compensate for the difference between the two distribution. This is a very useful trick, but that's not quite what we're gonna do here, okay? What we're gonna do here, this important sampling trick also is very general. So another thing that Haya was pointing out, she asked, is the problem still at MVP? Is this still Markovian? Because Monte Carlo is totally agnostic with respect to that. Okay? It's only when I tell you the procedure that you give me a state, I give you back a function. But if you ask the system, give me a trajectory, you could run Monte Carlo without knowing if it's Markovian or not, okay? So Monte Carlo method doesn't care about this if the system is Markovian or not in itself. But we know that our system is Markovian, so we are not leveraging on that, okay? So this is something which makes Monte Carlo again, very powerful, very general, but also less efficient, both from a practical and from a conceptual point of view because it's not leveraging on some knowledge that we have, okay? So the two points that push us to go beyond is that Monte Carlo is sort of expensive and does not fully leverage on Markov, on the structure of the Markov problem, okay? So what we want to do now is to use something that we have already obtained before, okay? So we want to recast the problem of estimating the value function because that's a statistical estimation problem in terms of something which already includes Markovianity and will turn out to be more efficient. So what is the way to go? Well, the way to go is to use the recursion relationship. So I will refresh your memories on this, but so very early when we introduced the very definition of the value function, this is the value of a policy, you might remember that we derived very briefly this relationship which says that the value function at a given state for a certain policy is the sum over actions and new states of the transition probabilities and the actions are picked according to that policy of what, of the current reward plus gamma, the value function at the new state. So here we have been using Markov property because what we're saying here is that the expected value of my full return from a state is what I get at this step plus what I get from the next step onwards following the same policy. So here you really see this Markovianity. It's connecting what happens now with what happens at the next step. So, and this object here, this is a linear equation. This is a linear equation for the vector v. Remember, v pi is a vector, okay? So from now on, since we are always using one policy, say the random policy or whichever policy you want to evaluate, okay? Sometimes I will drop this pi below. You must always remember that this depends on the policy but since we are working with always the same policy, whatever it is, sometimes I forget about it, okay? Or I deliberately do not put it in order to make the notation less cumbersome. So another way of writing this is to use the vector metrics notation in which, for instance, we introduce a vector, a row vector. So all the things we already did in previous lectures, okay? I'm just refreshing them to you. You can define a row vector, which is the sum over new states and actions according to the policy of the words, okay? So this object is the expected reward from state S. Okay? Expected because you pick action A according to pi and you return action as state S prime according to P and then you get the average reward small r. And then you remember you introduced also the transition probability across the states, which is exactly what the name means. So what's the probability of jumping from S to S prime under the policy pi? And if we use these objects and we see them as a row vectors and the square matrixes, matrices, actually one matrix, the recursion relation that you have here, actually is rewritten in this following form. It's just like the transpose. So this was not the row vector, it was just the vector column vector. So R transpose, which now is a row vector is equal to V transpose times the identity matrix minus gamma P. And this is nothing but rewriting this equation in a slightly different form. We are ranging terms when putting V on the same side as, so sorry, putting gamma on the left hand side, the gamma term on the left hand side and then replacing the quantities. This is absolutely equivalent to the other equation which has been looking right. Why do I do that? Just because this highlights the linear algebra nature of the problem. So here, this object is, this one is invertible if gamma is little less than one square matrix. Okay, so this is a row vector and this is a known in MDP, okay. In MDP, this is a known vector. And in MDP, this matrix is known, okay. And this thing is your unknown. So calculating the value function in MDP is a linear algebra problem, evaluating the value function. Given a policy, you construct this row, this, sorry, this rewards vector and this transition matrix and you just by linear algebra, whatever method you like, you solve this problem. Now, the question we are asking is, can we solve this linear equation without knowing what's the matrix on the right hand side and what is the vector on the left-hand side? Thank you. The way I see it, I say it doesn't make sense, right? But I have to make it a more careful statement. So can I solve this linear equation without knowing what the right-hand side and the left-hand side are but having samples of these things? So if I replace the matrix and the vector by some samples of these metrics and vectors, can I solve for the value V? And how do I do it? How do I guarantee that my algorithm will solve this problem exactly if I have sufficient number of centers? You see, we're trying to solve an equation but now in the sense of a stochastic term, we're trying to solve an equation stochastically. So everything we will be doing from now to Friday will be exactly this, using the recursion equation to solve this problem. This technique, this overall technique goes under the name of temporal difference learning. So what we do now is we proceed in the following way. So for now, I will just sketch what is the idea, how to do it, okay? Then tomorrow, we will discuss the mathematics behind this idea in a simplified version, okay? So we'll sort of consider a toy problem of this stochastic approximation and we will see how to prove that this method is actually sound and then we go back to the full problem of learning the value function, okay? Because these techniques which rely on the concepts of stochastic approximation are much broader in scope than learning the value of function. So we use them for this particular purpose but as we will see, they are much broader in scope and they are definitely one of the tools that must be in your conceptual toolbox, okay? You must know what that is and why it's why it works and why is that that is important. But since this requires a lot of math for today, I will devote just the last 15 minutes in order to set up the stage for this, for this math. So tomorrow, we will just go through the demonstration. So what is the general idea? The general idea is the following. Let's go back to this question here, okay? And we are gonna massage it a little bit. We're gonna rewrite it in a different form which involves the idea of sampling, okay? So we're gonna remove, in a way, the parts that we don't know, okay? Which are these two. This one is in our control, okay? So that's not really a problem because we provide the actions because we know them but what do you know are the P's and the R's? So we're gonna rewrite this in a way to eliminate in a sense this and replace them with sampling quantities. And this will be the cornerstone of our algorithm. So how does that work? Well, first step, we just put everything on the same side, okay? So we're gonna say that this recursion relation is also equal to, okay, I'm basically doing nothing here. I'm just rewriting this as follows, okay? And this has to be equal to zero. So let's check what has happened here very little but something has happened in the sense that here, I had the value here was on the left-hand side. I moved it on the right-hand side with a minus and I also put it inside the summation, okay? Compare this with this. Now the value function has come inside the summation. Why is that possible? Well, just because if I forget about everything that is in the square bracket and I consider only these two objects, if I sum them over S prime and A, this gives one, okay? So note sum over S prime K, P of S prime, S A, right, S A, this is equal to one. Why is that? Because these are probabilities. So if I sum over actions, the sum over actions in pi gives one, okay? So this and this, they give one and the sum over S prime gives one. So all of this is one. So that's why I can pull the value inside the square bracket, okay? Nothing very fancy so far. But here comes the very simple step but which is fundamental. Remember, this is valid for all S, right? This is, these are cardinality of S equations. I'm not summing over S. I'm just considering this for every S. So this recursion is also equal, corresponds also to the expectation if I take actions according to the policy. So let's say if I take action A according to my policy, if I take a state S according to my transition probability and also the rewards of just the reward minus, sorry, plus gamma, the value in the new state S prime minus the value in the previous state S, okay? Condition on the fact that the new state is equal to S. So if I want to put explicitly the times, this would be the state S t, the time t plus one, the reward t plus one. So maybe this is even more transparent. This is equal to zero, okay? So do you see that this line here is the same as this one? But only that now I've expressed this implicitly as an expectation value over the policy and the model. So this object here, which appears here in the, under the expectation value, is a very important object, which has its own name. It's called delta t plus one. That's the symbol for this. It's the stochastic object because it depends on the current sequence of states, sections and rewards that you observe. And it's called the temporal difference error. This stands for temporal, this stands for difference. So why is this a temporal difference error? Okay, so let's have a look at it. So we can rewrite this delta t plus one as the current reward minus the value function in that state minus gamma, the value function in the subsequent states. Okay, this is just a very simple rewriting of this term in which I collected the minus sign in front of this. So when we write it like this, we see that, okay, this is the actual reward at the time t plus one. So when we have the sequence S, T, A, T, S, T plus one, and this is the expected rewards. This just follows from the definition of the value function because the value function is the expected reward. So the expected reward from the state S, T minus gamma times the expected reward from the next state is exactly the expected reward. You just have to look back at this equation here. Okay, so that's the meaning of V. So in the sense that this is better rather than, so let's use probably more than expected reward, this is the estimated reward according to the value function. So that's why it's called an error because it measures the difference between what you actually observe and what you estimate on the basis of the value function. On average, these two things are the same. That's why this equation is, the recursion equation is also, if we write this, the expectation of the error is zero according to your points. This is just manipulation of the equation, but it's instructive because it tells you that this term you can interpret it in terms of a difference which stochastically may change from realization to realization, but on average, it must be zero if this object is the value function by definition. So what have we done here? Well, we have transformed our set of equations of linear equations, which is the recursion relationship into a set of stochastic equations. Now we don't have any longer the value expressed in terms of the coefficients of this equation, but we know that on average, the value must satisfy some condition. So this way of writing the recursion equation here so the way we interpret this, this is the condition on the value function. The value function, to be the proper value function of that policy, it must satisfy this statistical condition. So why is this interesting and important? Because this gives us an idea about how to solve. And here is the following. Suppose we start with a guess for our value function. So we don't know what the value function is and we start with a guess and we compute the temporal difference error for this guess. On average, it will not be zero because if it were zero on average, this would be my exact value function and my problem would be solved. So I try with some value function and I know that on average, there will be some discrepancy. So my average temporal difference error, maybe it's larger than zero. And this is telling me that my first guess was pessimistic because it was predicting a reward that is smaller than the reward that actually my policy gives. So maybe I can use this for correcting my estimate. So the qualitative idea is that even though your equation is now purely stochastic, it still provides you information on how to correct your estimate. That's the first insight into this. But the deeper insight actually comes if you think about a simplified version of this problem. So I will introduce this now and tomorrow we go into further detail. So let's unpack this. So what is the idea? That is the following. If you look, suppose you want to just look at the oversimplified version of the problem which is just the one-dimensional, okay? It doesn't make sense for any Markov process but just let's abstract away and think about a simple problem. So what could this be? So the recursion relation means that if I write it, it's just like R transpose, I rewrite it here, okay? This is the matrix equation. Now, suppose that this horizontal axis represents my value functions, it's abstract, okay? And so I have here, this is my reward and this is a linear object which cuts it off like this. So this equation, if I think of it, it's just like a hyperplane intersecting a vector which gives me a vector as a solution, okay? That's the geometrical interpretation of linear equations. And I'm just considering the sketch of this thing in one dimension, okay? So I can also write this down in the other way like I did before, just like RT plus gamma VTP minus VT equals to zero, okay? Which is, again, the same thing. So let's redraw this object. So this will be my delta, my expected delta, okay? So how does this look like? Well, in terms of the value function, well, I have to do R minus that. So this is gonna be something like this. So here is my true value function of the policy. And now this true value function is here. So this V is the space of all possible vectors and this V pi is the only vector which truly is the value of my policy, okay? Which is the one that I want to find. So if I am an MDP, okay? If the problem is an MDP, well, I just have to find this zero, but I know what the line is. So I know what this object is because I know the coefficients of this line. I know this and I'm not at the intercept. This is what an MDP looks like. Finding the zero of an equation, if you wish. In this case, it's even a linear equation. What we want to do now is to replace the knowledge of this line, stochastic approximation. Stochastic approximation does is that, I don't know what the blue line is, but I can interrogate my system and get some samples. So I want to find the green point here without knowing what the blue line is, but I can ask my oracle to produce samples that are on average on the blue line. So if I'm able to come up with an algorithm that solves this problem, where solves means that I have to specify under which sense, on average in probability. So the strongest thing would be, can I find an algorithm with probability one finds my true zero of my equation without knowing the blue line, but just knowing that on average, my red points are around the blue line and how fast is this? Okay, so these are the kind of questions that we have in mind. So tomorrow, we will see how stochastic approximation works in a simple one dimensional problem like this. And we will see how this is connected to stochastic gradient descent and two other important concepts in machine learning. They are actually the same thing. So, and then when we've done this, when we've completed this part, we go back to our full multi-dimensional problem. And therefore we can formulate our strategy for temporal difference learning for the value function. So that's the plan, okay? Any question about the outline? So be prepared tomorrow morning, there will be a first hour, which would be mathematically heavy. Okay, we're gonna take all the steps, but it's, I think it's a good investment because I'm not sure you, I think that you are probably have been exposed to the notion of the stochastic gradient descent, but I'm not sure if anyone has proved to you what are the conditions for convergence of stochastic gradient descent. And that's what our lecture will also provide tomorrow. Okay, fine. So with that, any questions, dots, complaints, not for today, tomorrow, yeah. Okay, so if not, that's not the case. Have a nice day and see you tomorrow morning. Bye-bye. Thank you, bye-bye. Thank you, bye. Bye, thanks. Thank you, bye.