 Okay, excellent. So let's start. So quick recap, where are we? So during the last class, we outlined the plan which is to understand how to control a system when we do not have a model at hand. So we have some a priori knowledge about our system. We know what the state space is. We know what the action space is, but we don't know how actions turn states into new states. So we don't know the probably transition from states to new states. And we don't know how rewards are given to the agent in response to its actions. And nevertheless, we would like to be able to control the system. So this is the point where we really depart from the classical picture of optimal control or model-based control, okay? So this is the part where the learning comes in heavily. So how do we do that? This for pedagogical purposes, it's good to partition the problem into two parts. The first part is how to learn to predict. And the second part is how to use this prediction for control. This is the same partitioning that is present in the book by Saturn and Bart. So you will sort of find the same conceptual organization that we follow here. Except that in these lectures, I'm gonna provide you a lot more background information about the techniques that I use for this learning part. So we discussed last time that one way of learning the value of a function that is to obtain from data, from interaction with environment, what is the expected value of a certain policy, given fixed policy. And one possible way is to go to Monte Carlo techniques which have advantages and disadvantages as we discussed. But now we wanted to explore something different and this something different leverages on the notion that the Markovianity of the system allows us to write down recursion relationships for the value function by which I mean that the value function at the given state is actually connected with the value function at the successive states. And we can use that equation, this linear equation for the value function in a different way to solve it in a sense without knowing exactly what the coefficients are but having access to samples of these coefficients in the form of probabilities, actually empirical probabilities of visiting states, okay? So in order to do that, we will consider a problem which is even wider than this and which is the problem of stochastic approximation. So this will be broader than the problem we have at hand because we will be able to consider no linear equations to be solved by this method, whereas the recursion relation is linear in the value function. So in a sense it's broader but for pedagogical purposes once more just for the purpose of teaching, we will consider just the one dimensional case, okay? So I sort of tried to convince you last time that there is an intuitive interpretation of solving the value function as finding out the zero of a function. So I will get back to that at the end of this part. So we just refresh the connection and see what are the similarities and the differences and if there are. But for now, let's just isolate this problem that we distilled. So for now on we're gonna focus on one particular toy problem which is however everything we need to understand what actually is going on for our system. So what is the toy problem that I'm talking about? Well, it's actually quite simple. It starts with an even simpler problem, okay? So the first problem is that suppose you have some function in a domain, okay? So there's an interval of the real axis in general, okay? So there's an interval here and you have a function here. Let's call it, this is the x and there you have a function f of x. And we are interested in finding the zero of this function. So there's a point psi here such that f of psi is equal to zero, okay? So to warm up, let's first discuss the situation in which f is accessible, okay? So you can inquire for the value, you can query for the value of this f function, okay? So the game is the following. So suppose that I act as an oracle, okay? So I know what the function f is. You can ask me the value of a function at any given point, okay? So if you prompt me with a request, give me the value of the function at a certain point x naught, I will return you the value of that function, okay? So you don't need to know the full form of this function. You just need to know that I can return you the value of that function, okay? So given this kind of information, this kind of exchange of information between you and me, and again, if you want to focus in terms of learning, you are the agent who's time to learn, and I am the environment which is providing you the data, okay? And f is something which is related to the environment. We will see them in the connection with the value function later. But the game is the following one, okay? So this is the order minus one problem, okay? So you see I'm not even introducing anything stochastic here, okay? So how would you solve the problem? So since I want you to wake up quite early today, I'm gonna ask you this question. So can you come up with simple ideas, simple algorithms that solve these problems? How to find this zero here, given that you can do the actions that I described previously. So you can ask me the value of the function, of this function at any point, but that's all you can have. That's a lot, that's a lot of information. So how would you suggest to proceed? You can use something like the dichotomy algorithm. It's like you take two far points and then you ask what is the value of the function. Then if it is positive and negative, you reduce your interval and so on. Excellent, so this is the dichotomic approach which works well if you are in one dimension, which is fair because I told you we're working in one dimension, but you should always keep in the back of your mind that eventually we're gonna move this to any dimensions. So we can ask you what the assumption again about the function, if you know anything or it's just... Yeah, yeah, right, you're right. So I was just introducing very loosely the idea. So the function, you can assume it that is decreasing like I plotted it. It has to be monotone, but let's fix the ideas on which it's decreasing and its derivative is everywhere slightly, sorry, strictly negative, okay? So it's not just like it becomes flat at some point. So it's what you would expect in case you just have one zero, okay? It's kind of conditions that allows you the problem to have just a single zero in your interval of interest. Is that what you were asking for? Yes. Do you want me to write? I mean, we can write the conditions here. So the conditions are that f prime of x is strictly bounded by some minus g with capital G positive, okay? So it means that if I take the slope of this, it never becomes flat, okay? It's always going down by some finite amount. And this is enough to guarantee you that there is a single zero in the interval of interest. And I would ask you two points and start to guess what the gradient is in order to move to all that direction. Okay, right? So one possibility is to ask the gradient, but I told you this is not in the rules. I can give you the value of the function. Yes, of course, but then I can approximate it with the... This is one possibility, okay? You can ask me the value at one point and the value at the neighboring point and we can compute the gradient of that. But there's something simpler. Well, if the gradient... Well, if the derivative is negative, we can just move into that direction until we hit zero or go below it and then go back basically. Okay, we're sort of approaching this. Can you just elaborate a little bit because there was a lot in that sentence. So how would it work? So let's take a step. So suppose we start with x equals zero. Sorry, x equals x naught. Then I give you the value of the function here. So what do you do next? Well, I could just take the value there. Then I don't know which is the direction of the... If it's increasing or it's decreasing. So I just move in one direction and then I see if the value is higher or lower. Let's say the function is decreasing, like we said here. It's going down like this one. Okay, then I would just see which is the current value and then I would move ahead of one step, say, and see if I am still over or below zero. And if I'm over, I still go... So if it's still positive, I may be accelerating. I could accelerate or just say that I go ahead by one step because we want... Okay, okay, okay. We are approaching the point where I wanted to get the... So I basically march down until I hit ground level. Okay, yeah. That's probably the simplest algorithm, which I'm gonna write down in a more obvious form, if you wish, which allows us to fix the ideas about exactly how to implement it. So the idea would be your next guess for the point x... For the point x, sorry. For the point x, say, step k plus one is the point x as step k plus some parameter alpha, which we're gonna fix in a second or discuss at least. And then the value of the function in that point that you sample, okay? So you see how the game works. So you choose a point x, k, starting with k equals zero. I give you the value of the function, okay? Then if the function is positive, this means that you have to step to the right. If the function is negative, it means that you have to step to the left, okay? And this alpha here, which goes on the different names, step size, but we're gonna call it the learning rate, okay? So do you agree on what this algorithm is doing? So x-note, compute f of x-note here, it's positive. So we're gonna take a step in this direction, which is equal to the value of the function times a parameter alpha. Clearly as we approach the zero, the steps will become smaller because the value of the function becomes smaller, okay? So in this case, you can prove that this algorithm approaches exponentially the zeros, okay? It's a very simple algorithm. It doesn't require to compute derivatives or to approximate them. It's just a way to approach the zero, okay? So far so good. And now this is a good point to make an important connection with something you might be familiar with. And the connection comes here. Just suppose we assume now that our f function, okay? So this is a remark. Suppose that our f function is actually the derivative of another function, okay? So what does it mean in terms of the function capital F? Well, the function capital F is clearly, you see, the derivative of the function capital F here is positive here. It's zero at this point psi and goes down after. So if I have to plot the function capital F, it's gonna be something like this. This is my capital, sorry, capital F of X. So in this condition here becomes the second derivative of X strictly negative. So, and what does this algorithm becomes? Well, small f is F prime. So for my capital F function, the algorithm becomes X k plus one equals X k plus alpha F prime in X k. But this is gradient descent. A gradient ascent, sorry, because we're looking for the maximum, okay? So this problem of finding a zero is just a broader perspective on gradient ascent, if you wish. It includes gradient ascent in itself. But it also mostly does other kinds of problems which are not necessarily gradients, okay? Because in many dimensions, you can ask the zero or the set of function, the solution of a set of equations, but not necessarily these equations are gradients of a potential function, okay? There might be just vectors which are no gradients at all. So it's a broader question which encompasses also the problem of gradient ascent, okay? So you see that this algorithm of finding zeros is basically a gradient ascent in these guys. In one dimensions, the two things are indistinguishable because every function is a gradient of another function, okay? Everything clear so far? Good. Then we have to move to another problem which is the one that interests to us. And the problem I'm gonna formulate it as follows. We're gonna go by the same rules almost in the sense that you can ask me for the value of the function at any given point, but what you will get in return is a noisy version of that function. So it's the true function plus some noise. And the distribution of this noise you don't know, okay? So what it amounts to be is that now we are playing the game by which if you ask me the value of the function at X node, I will not return you the point that you see here in the middle, but I will turn you something like a point here or maybe a point here or here or here, depending on the time that you ask me the question or whatever, okay? So we will make that some technical assumptions on what this noise is, but the question is always the same. If I return you noisy estimates of the value function, will you be able to find the zero also in this case? No, Idris says no. Okay, I'm gonna show you that you can. You can if the algorithm learns from the previous problem. I mean, the problem is. Yeah, just one at a time. So let's hear from Idris first and then Asha, Haya, just a second, please. No, because you said you don't know the distribution of the noise. So either, and I will show you that you don't need that. Okay. Okay. You're not lucky then. Okay. Haya, what were you saying? Sorry. I was saying if we learn from the value, if we estimate the value and if it's near the real value, then we can account for the noise. Okay, that's what we hope for, but we will see that it does not always work. So both of you are right, okay? In the sense that it's not always working. It's working sometimes, but you don't need to know what the noise is, okay? So we're gonna find a way the conditions under which this kind of algorithms can work. So and the basic idea is that in fact, we would like to use the same kind of algorithm as here, but now replacing the true values of the functions with their noisy estimates, okay? So if we go back to the interpretation in terms of these terms, what we're trying to do, the equivalent is stochastic already in the same. Okay, we're gonna do the same thing. It's just like climbing a function with noisy estimates of the gradients. Here we want to find the zero of a function with noisy estimates of the function itself, okay? Is the layout clear to everybody? Do we have questions before we dive into the math? The question is if we can get close to or we actually can get the optimal optimal. Excellent, excellent. So we're gonna show today that we get to the zero with probability one, which is the best you can do in a stochastic setting. So this means arbitrary close. As you will see that, so we can sort of have an informal discussion already before we put the math into. So what kind of problem could arise when there is noise if we try to apply the same kind of algorithms here, okay? So we are trying to use this algorithm, again, replacing noisy estimates. So just like using this plus some noise at step K, which I have not specified properties so far, but let's say just it behaves well, okay? In the sense that it has a zero mean and finite variance, but whatever else I don't care, okay? I will also assume that it has some technical properties like the martingale properties or whatever, but let's say it's a reasonable choice for the noise, that need not be Gaussian, not be anything, okay? All the rest is left to the environment to choose to decide. So what can go wrong with an algorithm like that? Okay. So remember there are two things that we have to do. So first we start somewhere, we don't know where the zero is. So we start here or we could start in any other point of our interval, which could be the full real interval if we have a function which is everywhere decreasing. And for instance, in our value problem, since the problem is linear, the domain of the function is infinite, okay? So the first thing that we have to do with this algorithm is to get close to zero as fast as we can, okay? So if we start very far away on the left, somewhere here, okay? So we would like to take large steps to get close to zero, which is one thing that we want to do. But when we take large steps and there is noise, then something nasty could happen when you are close to zero, okay? Because we might start flipping from one side to the other. And since there is always noise, we are never arbitrarily close to the point, okay? So you see that when there is noise, there is some tension between having large learning steps and small learning steps. So in anticipation to what we will show later, this algorithm can be made to work if you choose appropriately your learning rate, okay? So this will become the key ingredient. The same algorithm that we can use for the deterministic situation where I give you the true value of the function without noise can be applied in the presence of noise if we carefully control the learning rates in a way that I will make clear hopefully soon, okay? Any additional question before we start? Okay, you can stop me at any time if anything becomes fuzzy or unclear. All right, so now we're ready to state, let's restate the problem in a more formal setting, okay? So I'm gonna repeat some of the assumptions that I'm making. So I have a function F in some domain, Indy, and this is an interval of the real axis for us. And my function F prime of X is everywhere more negative than some minus G, G positive. And there is therefore in this, I'm setting myself in the conditions where of course at the two extremes, the function is positive on one side and negative on one side. So I'm assured that there is a point psi such that f psi is equal to zero and this condition on the derivative tells me that this point is a unique, okay? So there is one zero. So basically I'm just restating what I pictorially draw here. This is my underline F, but F itself is not observable. What we can observe is in fact a random variable phi such that the expected value of this random variable phi at point X is F of X, okay? So this means that this phi here is an unbiased estimate of F at any point, okay? Which is another way of writing that what I get my phi if you wish informally, phi at X is F of X plus some noise eta where this eta has zero mean and finite variance is less than some, some constant of which I don't want to use the names that I'm gonna use in the following. So let's call it capital C, whatever, okay? So this is all I'm asking at the moment. This unbiasedness condition can be relaxed, okay? But we're not gonna do it because the calculation becomes even heavier than they are. So if it's slightly unbiased or asymptotically unbiased all the calculation will work as well. But we will ignore this part for our class, okay? Okay? So the statement, so that I'm gonna prove later, so the statement is that the sequence of points generated by the algorithm X plus one equals to XK plus alpha K times phi evaluated at the point XK, okay? So this means the random estimate of my function F at the point X. This is what the oracle that is myself or the environment will return to you if you ask for it. So you see that now I've inserted, this is the key ingredient, I've inserted a dependence on the time step or the iteration step of the learning rate and we are gonna play heavily on this, okay? Professor, I'm a little confused between the difference between gradient descent and stochastic gradient descent. Well, what is the stochastic part? The stochastic part is that here your gradient descent, if you had the estimates, stochastic estimates for the gradient, so if you replace this term here with the F prime of X plus noise, this would be stochastic gradient descent. Okay, okay, thank you. At a very superficial level, but that's the substance of the thing, okay? All right, so as you will see, I'm introducing a dependence of the learning rate from the time step, which qualitatively speaking and we will see how this plays out means that what I would like to do is to have large steps at the beginning, where when I'm possibly very far away from my target point, psi. And as I approach it, that is as the number of iteration increases, I would like to reduce these steps in order to kill the noise, okay? So there is here and we will make it explicit in the end. There is a bias variance trade-off in this problem. At the beginning, we want to remove the bias and the bias comes from the fact that we have to start somewhere. We have to make some assumptions about where our zero is. That is we guess some X node at the beginning. This is my initialization here is choose X node indeed. And in general, you might have no information and you may start somewhere or you might have a guess, but anyway, any choice introduces some biases, some bias at the beginning. And you want to remove this bias, that is you want to approach the zero, but you also want to keep the variance under control. And this is accomplished by scheduling the learning rates. So this learning steps or rates are scheduled in time. This is a technical jargon in the sense that we are gonna find some dependence on time on the iteration step that accomplishes this delicate balance, actually not so bad decades, quite robust balance between bias and trade-off. Bias and bias, sorry, trade-off between bias and bias. Okay, so I'm gonna make the statement now. And the statement is that the right way to choose these learning rates is to ask that the series with K going from zero to infinity of this alpha K diverges. And the series from zero to infinity of alpha K squared converges, okay? So if these two conditions are sufficient, okay? These are sufficient conditions for what? This implies that my sequence of points XK for K tending to infinity converges to psi, my zero, with probability one. Okay, so that's the mathematical statements of stochastic approximation. You have an algorithm, which by the way is called the Robbins Monroe algorithm. Monroe algorithm, these are two persons, Robbins and Monroe, and the paper is from 1951, okay? So it's something that just to put everything historically in perspective, Bellman was working at optimization problems at the same time as when Robbins and Monroe were working on stochastic approximation, then the two things merged much later on, okay? Merged in the 80s, 80s, 90s. But these two objects were well known much before that, okay? So is the statement clear? We have an algorithm. We have conditions on these learning rates. And we want to prove that this sequence of points converges almost surely to our zero, fine. So, yeah, please. Can I make a question? I didn't understand why doing small steps reduce our variance in the near of the zero of the function. Okay. Because if there is a has a right variance, we always have the output given by the output our protocol is very, is noisy, it's wrong. Also if we do small steps. Okay, fine. But what happens is that, so let's have a look at what is happening here when you approach the zero. So when you get close to here to our, so let's zoom on this area somewhere here and see what happens. So this is your true function, which you don't know. And then you're around and you get some estimate here, okay? Now, if your alpha is large, you will step very, very likely we will make a step across the zero, okay? And you will get some estimate here. For instance, here it may well happen that since your distribution is a finite variance, you might as well get a positive value for your random estimate, even if you are on the right. Okay, this may happen. And if this happens, this is gonna push you farther away from the point. Okay, so how do you deal with this kind of problems? So now suppose that you make very, very small steps. If you make very, very small steps, this means that from one time to another, you're not moving much. So you're sampling in a point which is very close. And therefore your variance is reducing, reducing, reducing. And therefore after many, many trials, you can be reasonably sure that your value here will be negative, okay? So you see, if you don't move much, you can kill effectively your variance. But if you move a lot, then it's gonna be a problem, okay? So just like this, I mean, when you want to take a photo, you have to stand very still. And if you try to take a photo when you move, it's complicated, right? That's sort of analogy. Is that any clearer? Okay, so is a sort of that if we continue to sample that particular point in the end by giving a sample and sample and sample, really it's the real value? Yeah, exactly. So this is, I mean, if you don't move, if you ask me 100 times the value of the function at the same point, you're gonna kill the variance very effectively, right? And this would mean that you, but however you're not making any step, like just like having alpha zero here, if you ask me repeated many, many times this. So this is very good from the viewpoint of killing the variance, but it's very bad from the viewpoint of killing the bias that is to moving towards zero. On the contrary, if your alpha is large, this is very good. If you want to move away from zero, but the point is that the variance is not killed as effectively. So you have to find the right balance. Is that any clearer now? Yes, thanks. Sure, no problem. Okay. Very good. So before we start, just the first another clarification if you wish, you might ask these two conditions is, I mean, how hard are they to be satisfied? Okay. So is there even a way to fulfill them? Okay. And actually, in fact, it's not particular difficult to satisfy these conditions. The only thing that you have to require is, for example, any sequence of alpha K, which for large K, for K tending to infinity, behaves like K to the minus some exponent beta, say beta, with beta comprised between, strictly comprised between one alpha and one, okay? This satisfies these two conditions, okay? For anything that is in that range, so that asymptotically behaves like a decay in power law with that exponent comprised between one alpha and two, you can easily check that these conditions are satisfied because the series above diverges and the series below converges, okay? So this is not something that it's very, very difficult to achieve. And actually there are very many ways of achieving this property. So it's not particularly restricted. But remember that these are sufficient conditions, okay? So you might have problems in which you can find schedules that are outside this range and for which your problem works as well, okay? So I hope we managed to get to the point where I can give you examples, but if it's not the case, bear in mind that these are sufficient conditions but not necessary. So there might be other choices that we don't do not include. But nevertheless, these are quite enough. Any other, please? I see. I'm sorry. I had another observation, but go ahead, maybe one at a time. Okay, yes, please. No, just quickly. Can we go through again the assumptions, the hypothesis of the theorem? Sorry, I didn't get to... Can we see again a second, the conditions of the theorem? This one? The assumptions, the assumptions on F and N. All right. Yeah. Which one are you? Concerned about? Okay, so we need, sorry. No, though, I missed the detail. We don't, so we don't need any, anything in particular about the distribution of noise. We just require that the function is, is the variable, is that correct? The actual function must be differentiable, yeah. And then we bound it derivative. And we bound it derivative, yeah. Thank you, sorry. No problem. Please. Yeah, I had another observation. I was thinking, basically, since we have noisy estimates of F, we are never going to be able to overcome that uncertainty in locating the zero, basically. Is that correct? I mean, we can never get more precise than what it allows us to get. Well, my point is exactly that by properly choosing the learning rates, you can actually overcome the noise asymptotically. So for large number of iterations and eventually that's exactly the statement that you converge to the true zero with probability one. Okay. No, because what I was thinking was, if we know that, if we can have some assumption as in this case about noise, like that it has fixed variance. So basically we always get a noisy estimate with the same maximum error from the actual point. We're gonna use this. I mean, we're gonna use the fact that the variance is bounded. Okay, okay. So basically we know that wherever we are, if we ask enough time to the oracle an estimation for the same point, eventually we're going to figure out what the variance is. And so we are going to know how far we may expect to be from the actual zero. This is guaranteed only that this is a poor way of finding the zero in the sense that you spend a lot of time in the same point and you don't move fast enough, okay? Yeah, well, no, what I was thinking was, maybe it's just a thought experiment that maybe we can do this process of estimating variance anywhere. So it doesn't matter where we are. Very good, very good, very good. So this allows me, yes, let me just barge in and because this gives me the opportunity to discuss one thing that I want to skip over, but now since you asked, you could have solved this problem another way, okay? Which is for instance, you choose a certain set of points, the yellow points here, okay? And you're sorry, they must be in the domain. And you repeatedly ask me for, I don't know, 100 estimates in this point here, okay? So you will have a cloud of this and you will have a cloud of this. And then from this, since you have 100 points, you say, okay, my error is gonna be 1% around this value of the function, right? And so I could sort of construct an approximation by uniformly sampling my domain and then trying to draw the line and the intercept and go about like this. Is that anything close to what you're having in mind? Yeah, basically what I was thinking is something like this, like maybe before even starting to move, so at the beginning of the algorithm, I don't move at all, I just ask for the current position to the oracle. So I get an idea of what is my interval of uncertainty. And then once I have these estimates, so basically I know an interval, how big is the interval of my positioning in the space, then only once I have an estimate of this error that I am enough confident with, then I start moving without, well, taking into account this error, but anytime, I don't know, I move the center of the interval basically. It's clear to me, I think it's also clear to all the other and what you're actually doing here is explore and commit. So you're gonna do all the part of the exploration first and then you're gonna go down your approximated gradient. Yeah, it's basically I'm turning the knob completely toward the exploration at the beginning without moving at all, just looking around and then going forward exploiting a thing. And now I will give you two reasons for why this is a loser's strategy. The first one is that if you live in many dimensions, this attempt at covering the space is hopeless. Yeah, sure, sure. First, second point, if you do not adapt the grid over which you are looking at your function, at some point when you get close to the zero, okay, you will be in a situation where there is a shortage of information because your two points will be wide apart and here in the middle, you cannot overcome the noise. So you will need to sample more and more close to zero. Do you agree with that? Yeah, yeah, yeah, yeah. So the reason why we do this algorithm is exactly for these two reasons. First, because it allows to search rather than to sample first, okay? So you sample only the points that you need for search, which is again, one instantiation of the balance between exploration and exploitation. And second is that you do that adaptively. In a relatively stupid way, admittedly, because you adapt only with respect to the number of steps and not with respect to the value of the function. So there are better ways of doing this, but still this is a way in which you sample differently depending on where you are. Whereas in the approach that you suggest, that I mean, it's perfectly honest at the beginning, you don't do that. So you see the shortcomings. Yeah, okay, I think I understand, yes. Okay, great. Very good. That's all excellent. So I see you're alive, but we have to start with the math for real now, okay? So let's do it, okay? So I think this is gonna take us sort of some 10 to 15 minutes, but we can slow it down to zero almost if you get stuck anywhere. It's not gonna be very complicated, right? So bear with me. So we're gonna sort of give a proof in a sense of mathematically reasonable, but not fully rigorous theorem-based setting, okay? But it's proper, okay? So first of all, we want to show that this holds with probability one, okay? Now, checking that the probability of a sequence is close to one, it's very difficult to state mathematically. So luckily enough, there is some other condition which is a stronger than probability one, and therefore implies probability one. So what we are actually gonna show is that if we consider the expectation of X, say N, I'm gonna switch between K and N doesn't matter, it's just one step of my algorithm, minus zero squared. If this expectation goes to zero as N goes to infinity, this implies that XN goes to XI with probability one. Do you agree with that? So if the points, if the mean square error, which is this object, between the true zero and my estimate of zero goes to zero, it's impossible that there is any probability left outside that, okay? You can prove this by contradiction. In the sense that if there is some probability that my XN, finite probability that my XN eventually is not XI, then the mean square error cannot be zero, okay? If I don't hear any objection, I think that this is, you're gonna buy this and you should. So what we're gonna focus on is this statement, which in turn will imply that probability, with probability one, we converge to that. And the way we're gonna prove this is by means of a relatively brutal calculation, okay? So the way this calculation goes, so let me call this the proof because I don't want ever to have a mathematician looking at this video and insulting me for having called this the proper proof, but it's sort of argument that captures quite the closely of the necessary steps. So the way we go is just we take the expectation value at step N plus one minus side, okay? So this is the same object as before, just one step after. And we are gonna use now here by definition of the algorithm. I'm gonna replace these objects here, okay? I'm gonna put inside this exactly this expectation here. So this is just by definition, it's equal to the expectation of XN minus side, and then I have plus alpha N phi, which is my random estimate of the function, all of this squared. We're doing nothing here, just replacing the definition of the new step in the previous step. And now again, doing very little, I'm gonna just unpack the square and try to get out the square of this difference at the previous time and all the other terms, okay? So I'm just expanding. Again, nothing serious is happening here. And then I have twice alpha N. So alpha N is a number, a positive real number. So it goes out of the expectation value. There's nothing random here. Expectation of what of the product of these two, which is XN minus psi times phi of XN. And then I have the last term, which is alpha N squared and the expectation of phi of XN squared, okay? So far so good. So let's recall, what does this expectation mean? So what is random here? Remember, every time that you pick a value of X, I will return you a stochastic estimate, okay? So this expectation is over all the randomness that is there, over all the sequence of noises at all times. So at step N minus N plus one, there have been N plus one noises before that contributed to this expectation. Okay, very good. Now we're gonna do something for real here, which is the following one. Let's focus on the middle term, okay? So I'm gonna highlight this middle term here and we're gonna work this out a little bit. So this expectation of this product so what is it? So remember that this term here is the value of the function XN plus the noise which we have evaluated at step N. This is the noise term, but this noise, okay? It's something that is fresh that every time is refreshed, okay? So it doesn't depend on the actual XN that you choose. It's something which is up. This is the martingale property that I was talking about at the beginning, which is technical, but means basically you can replace it by the assumptions that noises are independent at every drill, okay? So if noises are independent at every drill, this estimates, this means that the part here is totally independent of the noise which is present here. So I can expand this and take the averages and remember that the average of the noise is zero because phi is unbiased. So this object, I can actually replace it by the expectation of XN minus psi times F of XN. Well, I averaged out the last noise. There still is an expectation because this point XN depends on all the previous noises that I had, okay? So I am there because I had a sequence of noises which affect my algorithm, and therefore there are still some averages to be done, but I just averaged out the last noise that I've seen. Okay, now I'm gonna work a little bit more on this term and I'm just gonna do a very, very simple manipulation, which is I'm gonna write this as XN minus psi squared by half of XN divided by XN minus psi. Okay, here you should be sort of kind of alert because I'm dividing by something which could be zero, okay? So this might be potentially dangerous, but of course, I mean, this happens with probability zero that you are actually on top of your zero, okay? At every step of the algorithm, the probability of top of falling exactly on a real number, this happens with probability zero. So this term is divergent with probability zero. So we shouldn't be particularly worried about this, but again, this is sort of a patchy. This is where the physicist comes in and does the bad mathematical stuff, but you can fix this properly mathematically, but just stay with me, okay? Good. Now, why do we want to do this? Well, because this object here that I have here, this ratio, well, what is it? If I go back to my picture above, this is, no, I'm gonna redraw this because it's gonna be a mess. So let's redraw this object here. So remember, the function has always a derivative like this. So for every point XM, this is my value F of XM. This is psi, okay? So this ratio here between the value of the function and the separation here is a negative object. It's always negative because if I am on the left, F is positive, but the distance XM minus psi is negative. On the other hand, if I'm on the right, my F is negative and my XM over N is, my XM minus psi is positive, okay? So this object here is always strictly less than zero. And in fact, because of the condition that F prime of X is everywhere smaller than minus G, sorry, you cannot read this. So let me erase. In fact, you can make a stronger statement because the condition that F prime of X is strictly smaller than minus G, implies that F of X divided by X minus psi is also smaller than minus G for all points in D. This is just, you may think of this object as a discrete derivative and this is necessarily bounded by the true derivative, okay? Again, this will require a couple of passages to be proven mathematically, but the intuition is quite obvious. If the curve cannot be steeper than something, then any finite difference cannot be steeper than that. So why do we use that? Because this allows us to say that this term is in fact not only strictly smaller than zero, but is actually smaller than minus G, which in turn is smaller than zero. So we can bound it. And if I combine all this together, this means that this average here is also smaller equal than capital G, which is positive, remember? It's the modulus of the derivative, maximum derivative. This is more than the expectation of Xn minus psi squared, and that's it, okay? So why did I want to do this? Well, because by this technique, I got rid of the phi, which is in the middle, I got back to something which is of the same kind as the first term, okay? So I'm trying to boil everything down in terms of this mean square errors so that I can have some recursion and use it to prove that these mean square errors are actually going to zero. I hope you're with me still. Okay, so let's plug back in this inequality in orange or yellow. Let's plug that back in here, okay? So we restart from here. Let's call this a question one. And we begin again from this, which means that if I rewrite it, the left-hand side is this one, which is then smaller. So this object here is gonna be less equal than a term on the right-hand side, which is explicitly. This object is gonna be smaller equal. This is just taking it, this part was the exact one. Then I have my inequality, which tells me that this is minus two times G e over xn minus psi squared. And then I have another term, which is plus, sorry, there was an alpha n here. And then I have alpha n squared. The expectation of phi of xn squared, okay? So first part done. Now, second, we're gonna not take, sorry. We're not going to take care of this term here. So this object here is, it's the square of my estimate, okay? But the expected value of a square is the expected value of the function squared plus the variance of the function itself, okay? I'm just, the second moment is just the first moment squared plus the variance. And now both of these things are bounded, okay? Because my function here is finite over all the domain. This is my f of xn squared, which is bounded. This object is bounded by hypothesis. This one is bounded. So I can say that this term overall is a smaller equal than some capital H. Okay, this is just the consequence of the finiteness of the mean and the variance of my estimates as it should. Are you okay with this? Fine. If we can do that, then we can add another step to our inequality. So this is my question two. And then if I get, take it back to, from two, I get that further inequality which is less equal, sorry, let's rewrite this again for clarity. This is expectation of xn plus one minus psi squared less equal than the expectation of xn minus psi squared minus two g of ian expectation of xn minus psi squared and I'm coping so far. And then I added the last step which is plus alpha n squared H. Okay, so far so good. We have a few steps left. So do you want to take a break here or do you want us to go through, it will take at least some five minutes more. You want to take it right now? Okay. I think it's finished. Let's do it. I prefer to finish here. Okay, let's go. Let's go. Good. Fine. So the next step is that what we're gonna do is something quite simple first. So let me do it online. That is without rewriting the formula. What I'm gonna do is just I'm gonna move this term here to the left hand side, okay? So which means that I'm gonna do it in a way that I hope you will tolerate. It's just that I put the minus here and then I get this. Okay, have you seen what has happened? Nothing serious. I just moved this on the other side. And why do I want to do that? Because this term becomes the difference of the two mean squares and two consecutive steps. So what I'm gonna do now is I'm gonna take a sum. So your sum N from zero to sum capital N. Okay? So I'm gonna take the sum of both sides from the starting point to some point N, capital N, okay? Or maybe capital N, yeah, capital N. Yeah, right. So why do I want to do that? Because the sum will be a telescopic sum, okay? So when I sum the differences, all the terms in the middle will go away and I will be left just with the last and the first term in the sum. So this means sum with N going from zero to N on both sides of the inequality. Okay, so like I said, if I do the telescopic sum on the left hand side, what I get is the expectation of X capital M plus one minus side squared minus the expectation of X zero minus side squared is less equal than, and then here I have to do nothing but to rewrite the sums. So this is gonna be minus two G sum from zero to capital N of alpha N expectation of XN on the side squared plus H sum N going from zero to infinity of alpha N squared. Okay, and now you see where is the first point where our assumptions start to matter, okay? Because- Sorry, isn't that a capital N, the infinity? Yeah, you're totally right. All right, okay. I was stepping ahead. It's not time yet to take the limits. Thank you. Sure. Okay, excellent. So, but you can also have an insight on why we asked that initial condition here, the condition on the convergence of the sum of the alpha squared. Because this will allow us to make this term on the right meaningful because if it were diverging then we would have a trivial inequality, something less than infinity which we really don't care about, okay? So this is just to give you a sense where these conditions emerge naturally from this kind of computation. Very good. So, very, very few steps now. Just a couple of them into a little bit shape. And what we have just to do now is just to rearrange this sum a little bit. So, what I'm gonna do is I'm gonna take this second term here. Yes, is that what we need to do? I'm sick. So I check where that thing is in place. Because if I make a mistake now, it's gonna turn out to be MS. Yeah, that's fine. Yeah, right, good. So, what we're gonna do now is we're gonna take this bit here and we're gonna move it to the left-hand side. So this is just then again, a little bit of rewriting with shuffling, nothing serious. But I also divide by this 2G, okay? So, let's write it and then if you don't see it, you can complain openly. This is more than one over 2G times. What was on the left-hand side that now goes on the right-hand side with the minus. So this is gonna be, and I have still this plus H, some zero to N of alpha and square, okay? So this is a minimal rearrangement, just taking one thing on the other side and dividing everything by 2G, which is positive, strictly. Any question? No, there also is the one over 2G for the H, should be. Thank you, right. Very good. Now, this term here, this object is positive, okay? So, and there's a minus in front of it. So we can further bound this object. So we can remove it from the inequality and further bound it by using one over 2G, expectation of X node minus psi square plus H over 2G, some zero to N of alpha and square, okay? Okay? And now we are done, essentially, because now we just have to think and to use our hypothesis. So remember, our second hypothesis was convergence of this object, okay? So this in the limit, when we take the limit of N going to infinity, this object here converges, okay? So basically this means that this oldest term on the right-hand side is bounded from above, cannot go to infinity. But let's look at what we have on the left-hand side, okay? So the comparison here is between what we have on the left-hand side. God's sake. So we have this on the left-hand side and we have these on the right-hand side. And by my hypothesis, okay? Everything that is on the right-hand side must be smaller than some other constant capital N because of convergence of the series, okay? So for any capital N, the series, there's a series of positive numbers. So if it converges, there must be an upper bound to it. And this first term is just how far I am from my initial point, okay? So the only condition we are able to impose is that I don't start infinitely far away from my zero. If my interval is finite, this is fine as well, okay? So this will be bounded by two terms. One is the sum of learning rates and the other one is how far I am from the origin, from this or from the zero, okay? But if this is bounded, okay? Now I use my second hypothesis, which is a sum over alpha M, M going from zero to infinity diverges. Now, if my sum of the learning steps diverges, it's not possible that this term here to be finite. This cannot be finite because if it were finite in the limit, the left-hand side will diverge as well, okay? So this term, this implies that the expectation value of XM minus psi square must go to zero. You see, I have something which is finite on the right and I have a series of terms on the left, which has one part here that doesn't go down fast enough because it will diverge in the sum. So if by contradiction, if you assume that this term here, this mean square error goes to a finite value, this will mean that the left-hand side is infinite. So it doesn't work, okay? These things cannot work unless this condition is matched. And this implies XM going to psi with probability one. So this is the essence of the mathematics. Now we take the well-deserved break and we think about what we actually been doing for real with this demonstration, with this proof, okay? So I think we can start again at 25. Okay, thank you. See you later. Okay, back we are. So let's try to wrap it up in this remaining half an hour that we have because we have to sort of complete our understanding of what we have done, move to the multidimensional case and then revert back to the value function, okay? So it's gonna be a right through these different concepts. But the first thing is that when we try to realize what this algorithm is actually accomplishing, okay? So the statement is we have sort of proven that stochastic approximation converges with probability one to the zero of the function, which means that we have also proven that in one dimension. So stochastic gradient descent converges with probability one to the maximum of a concave function, okay, given the scheduling of the rates that we have chosen. So before going to the multidimensional case, let's first discuss very quickly what is actually happening. So this is sort of just a remark. So first remark, let's consider a slightly different situation in which we have the number of steps, capital N is fixed a priori. So it's a slightly different version of the algorithm in which you don't go on forever until you are close enough to your zero. But you say, I have a budget of iterations to do and I want to keep that fixed, capital N. And then we use alpha also fixed them, okay? So we have no scheduling here. So it's clear that this is not gonna have exactly the same performance as the other case because we cannot go on forever and we cannot get it down. But we can repeat the calculations as we did now which are slightly simpler. And the bottom line is that if we repeat the calculations which I think is also a good exercise to do before getting back to the proof of the theorem because everything is more straightforward and then you understand the little subtleties. But if you repeat it, what you get is that the sum N going from zero to capital N of the expectations of the squares is less equal than the expectation of the initial point squared divided by two G plus H over two G. There is a half alpha here, there is an N here, okay? So this is exactly the same kind of calculation as above only that now since the alphas are constant they get out of the sums and then everything is simpler. So this object here, what is it? So this is the cumulative error, okay? Squared error. So at every step, if you are away from your zero by XN minus XI then you square it and you average it and you sum. It's a measure of the regret of your algorithm. So how many errors you have accumulated over time? And what this calculation is telling you that this object is bounded by two other objects, okay? So this first object here is related to the initial guess, okay? So this is bounded if you wish by the width of the interval, the size of the interval squared divided by the gradient and now. So this is the term which is sensitive to the initial bias. So the closer your guess is to the actual zero the smaller this term will be, okay? And you see that this term features the learning rate at the denominator which means that if you want to make this term smaller the first term is small, which you should do because if you make the right hand side small this means that also your left hand side is small. So your regret, the cumulative error of your algorithm is small, which is something that you would like to have. But in order to make this first term small you would like to have a good guess which makes the numerator small and the large alpha, the large learning step because this will make you run closer to the zero faster, okay? This second term here on the contrary it features basically the variance here, okay? This term is depends on the variance of your noise. So if you want to make this term small well one way is just to know that if your estimates are very good the variance will be small, okay? But this doesn't depend on you you get to the variance that you get. What you can get here is that you can reduce alpha. Having small alpha means that you're making many, many, many steps close to each other. So you're reducing the error due to the variance. But of course this depends on the number of times that you make it. So here there is, sorry, this is not small n it's capital N, okay? So you see this is essentially what these algorithms are trying to achieve to balance the initial bias and the variance, the cumulative error of the two variants together. And in this particular case, if you keep a fixed alpha then you realize that there is an optimum choice for alpha. And this optimal alpha is, I mean you can the right, I don't know if I would hear and I don't know if you mess up with the, okay? But you can derive it, it's the optimal alpha goes like essentially, okay, I can do that. It's gonna be n to the power of one half and here you have on top of the square, okay, this would be h divided by the initial distance, let's call this d. Where d is the expectation of x minus side square, yeah. This is not right, you have to do the opposite, okay? So if you just, I mean, it's just you try to minimize the right hand side with respect to alpha and it's a simple quadratic equation that you have to solve. And if you solve it, that's what you get with the numbers that I'm not writing because I will certainly mess it up and I don't have them written on my script. But the basic idea as you see is straight off. In order to get, if your d is large, if your distance from the initial point is large, then you want to have a large alpha. If your variance is large, then you want to have a small alpha and this is the compromise. And this compromise is sitting at the boundary of the conditions imposed by the Robins Moro algorithm. So you see, it's just at one of the two boundaries for my choice. Robins Moro actually excludes it, but Robins Moro wants to achieve something stronger than this second part in which I have a fixed time horizon. But this is another way of looking at the theorem from another angle, which shows you exactly what this kind of scheduling is trying to accomplish. Move fast away from your initial point and then slow down close to the target in order to give the variance. This is a good way of balancing exploration with exploitation. Fine. Any questions so far? Okay. As a site exercise, this is one remark that I'm not going to give it to you, but those of you who are interested, if you make additional assumptions on the function, okay, so if you assume, for instance, that you remember there was an assumption on F prime, if you make further assumptions, something like F second, then just make it properly. Yeah, if F, sorry, that's not the statement. If you ask F to be concave, okay, so decreasing N concave, you can make this convergence even faster, okay? But the price to pay is that you have to make assumptions over the function, okay? So this is just to tell you that this trade-off may be balanced depending on how much you know about the function, okay? I don't have time to go through this, but if you're interested, I can give you references. All right, because now what we want to do to close is just two steps. So first is to go to the multidimensional phase, okay? So this is contained in a paper by Bloom of 1954, okay? So this is also classical stuff, but I want you just a little bit to reason about what the problem becomes and because this is important when you want to apply this to the value function, okay? Remember, this is our final goal. So what, for instance, in two dimensions, okay? So if you have just two dimensions, what does this problem become? Well, suppose that we have now two variables, X1 and X2 and we must have two functions, okay? So we have a function F1 of our variables X1 and X2 and the function F2, the same variables and we want to set them both to zero. So we have a system of equations. Graphically, what does that mean, okay? So let's say that the F1 function is in green and the F2 function is in yellow, okay? So the level lines of the F2 functions, for instance, are lines like this and the level lines of the green function are something like this. And one of these level lines, this one is F1 equals zero and for instance, this one is F2 equals zero, okay? So think about this two F1 and F2 functions as some very smooth function, whichever definite direction in which they decrease and what we aim at and we want to identify the point here, which is the solution of these two equations. And of course we have to assume that these two functions are in a sense linearly independent in the sense that their level lines not overlap and et cetera. So let's assume all these nice things happen. Then the question is, how do I find the green point? Again, if you have no noise, you just do the same thing. So you try to go down the level lines of both functions, okay? But now we are interested in a slightly more difficult version of the problem, okay? Because when we go from one dimension to many dimensions, we could formulate our problem in different ways, but only one of this is relevant for the problem of learning the value function. So explicitly, let's think again about the problem of the order code. So suppose now you choose the point X1, X2, okay? Any point here. And you're gonna ask me, give me the values of the function F1 and F2. And I tell you, nope, that's not gonna be the rules of the game. The rules of the game are, you can ask me one of these two functions at the time. I can only give you one of them. It's slightly more difficult, okay? But the bottom line is that you can do this thing as well in this case. So the way you do it specifically, okay, you are gonna give me function one, and then I'm gonna go down along function one. And if you give me function two, well, fine, I'll do a step like that, okay? So there is clearly a way out also in this situation. What is less intuitive is that actually this solution of going steps along the directions is not necessary. Even if I give you only the function one at one time, you could decide to take a step in this direction, which is slightly more subtle, okay? But it's something which turns out to be important in a temporal difference learning, okay? So for the moment, just take it as a remark, a side remark. I hope tomorrow in the story you will be complete with any. So the bottom line of this is that even if you add the stochasticity, that is if you have noisy estimates of your functions, F1 and F2, the one that I give you, you will be able to reach with probability one, this target as well. And this is what the math tells you if you formulate the multidimensional program properly, okay? We don't go through that, but you have now you should have the means to understand what is contained in such demonstration if you want to, in such proofs if you want to go through it, okay? So because what I care the most about now is that we go back to learning the value function, okay? So let's just schematically recap what the analogies are, okay? So this is gonna be now a table of analogies between the simple situation we've been discussing and the language of temporal difference learning that is function approximation, sorry, stochastic approximation for the value function. So what is the zero that we are looking for in this multidimensional version, okay? Which is this point here. What is it? Well, the correspondence is that this is the value function of my given policy which is a vector. So the zero I'm looking for is the value function. So what is the equation that this obeys to? Well, this is just the recursion equation which in full I'll call you is just like sum over A, it's prime. So this is the recursion equation and clearly it's a set of linear equations which have to be satisfied by my value function. But of course, what does it mean to know my effort? It means to know my PMIR which is exactly the problem that we want to face. So we have to move on and what are the files then? What is my random function? Well, my random function, whose expectation is this recursion relationship, okay? Remember phi, phi of X was the object which was defined in such a way that expectation of phi of X was equal to my function F of X. So what is this? Well, this is the temporal difference error that I introduced last time which is the reward minus a generic vector V at ST plus one minus, sorry, plus gamma minus V. And my guess X at any time T will be some vector estimate for my value which I could see. So you see, this is the table of correspondence. Any point X, which is my guess is one estimate of the value function the true value function that I'm looking for is the solution of the recursion equation which is the value of the policy that I'm currently investigating, okay? So this is the correspondence between the things that we've been playing with so far and the things we are after. So when we summarize all these things and we put everything together we get our temporal difference learning algorithm, okay? Which since we are in many dimensions now I'm gonna formulate in this flavor, okay? So we take a step along any of these functions that we have gotten. Why do we get just one function at a time in temporal difference learning? Because at every step we are in one and one single state. Every time that we move along our process we visit one state at a time. We do not visit many states at a time. So we have just the sample of this vector along with the component which corresponds to the current state. Is that clear? So think about, let's make a quick example we have time for this. Suppose for instance you have a two state model. A two state MVP, okay? And state one and state two. So, and the policy is fixed. Say policy, e.g. random. Doesn't matter. How do we learn the value of this policy? Well, we start and we pick up one state, S1, S equals one. Apply the action and then you might end up again in one or in two. At that each time step you can measure this temporal difference error. But you measure it only from the state you started and to the state you arrived. So of this vector here you're sampling just one component at a time. The one which belongs to the value function graph, right? So in the space of value functions component one, component two. Here somewhere is your zero which is my value of the policy, the true value. This is the similar to this picture here. Same thing here, only for value functions. And I start somewhere here. So this is my first guess. Then I am observing the temporal difference error which is a scalar quantity. It's a number. From the state here one, maybe to the same state one. So, and now what step do I take? Well, in the simplest implementation we just, I was in state one. Well, I'm gonna take a step along the one axis. Now, at the second step, am I still in state one? If yes, I still move in this direction according to the error. Now I jump to state two. Okay, I'll make the jump here. Okay, and so by this moving along the axis, I can mathematically show, which I will not, that this nonetheless converges to the point. Okay, so I'm ready to formulate the first final point of our procedures that I'm gonna define an object which is called the temporal difference zero algorithm. And then later I will explain why there's a zero here, which works as follows basically. Choose an estimate V naught. Okay, at the beginning you start with an estimate for the value of a function could be all zeros. Okay, you don't expect anything to happen for you with any policy you choose and then you set everything to zero or you choose random numbers doesn't matter. Okay, fortunately enough, the problem is convex in the sense that it's a linear problem with the contracting objects that play on it. Okay, this is not bad man, remember, it's a linear problem. No, nevertheless, it is contracting because gamma is less than one. So it's a safe problem from the viewpoint of finding stuff. So it doesn't matter where you start. If you start with a good guess, you'll get closer faster to the solution. If you start with a bad guess, it'll take longer. So choose this and then repeat the following steps. Your next guess at the step T plus one of the full vector component by component is your previous guess plus the learning rate, the learning steps, which you schedule according to Robin's moral, the temporal difference error. And then, look here, this is a scalar. Now here you have to choose in which direction you want to move. And this rule of going along the axis depending on what you visit tells you that here, this should be one if your state, you are in the state that you are just visiting. So of your vector, you just change the entry which belongs to the state you have just visited. Let me move to another state. You sample the error, let me check, okay? But I recall you once more that the temporal difference error for completeness, this is. So you have a guess, which is your vector of values. From this, you observe the new reward and you construct the temporal difference error, which I recall you is a difference between the reward that you observe and the one that you would have predicted according to your approximated value function. If there is a discrepancy, then you adjust your new value according to that. That's it, okay? So this algorithm is the simplest way of performing learning of a value function. And it is way, way more efficient than Monte Carlo, okay? Because you're leveraging on the fact that you accumulate your learning. You start with a guess, which technically is called bootstrapping, okay? And you improve your guess. You start from somewhere, you bootstrap from some choice and then you use this bootstrapping in order to do. Monte Carlo does not do bootstrapping. It's just samples. Of course, bootstrapping introduces a bias, which depends on your guess. If your guess is good, the bias will be low. If your guess is bad, there will be no bias. But the scheduling of the rates according to Robb's moral takes care of this and balances this variance and bias. Okay, is the general picture clear? Of course, the details require a little bit of care, but it's very straightforward to implement. And tomorrow you will see this in action, okay? So we will write down Python codes with the manual which learn the value function in some simple mdb, okay? Final remark, why is this db zero? Well, it's called db zero because in fact, the more general form, you could write the update as follows. So as a vector, okay? So this is a more formal writing, not in component. You could write this, the new vector, the new approximation for the value function is the previous one plus the learning rate plus the temporal difference error. This is always there, okay? So this is always present. And now here you put something else. So this is a vector, okay? And this vector is called the L-E-G-B-D-T vector. So in the case of db zero, it's just a vector which has all zeros and one in correspondence to the entry that you've just visited. But in general, it could be any vector. What does it mean in geometrical terms? It means that it's not necessary to take a step along the axis. Even though you sample along one particular component, there is a way to take steps which move in diagonal. You can do that. And this is what this eligibility vector is doing. Of course, you cannot do things at random, okay? But you have to do it according to some rule. What is the idea behind this rule? Well, what you're doing here with this eligibility vector is that you are assigning credit to states, okay? So this is where the credit assignment takes place. So what does it mean in practice? Let's go back to this picture. You had a guess. You check this guess against reality. That is, you measure your reward and you compare it with what you were expecting based on your guess. Now it turns out that, for instance, you get a positive temporal difference error, which means that you are positively surprised. You were expecting less, but you have seen more. Now you have to sort of update your value function and you ask the question, okay, I have seen something which was better than before, but which state do I have to credit for this? In TD0, you credit just the last state you visited, okay? You say, okay, everything that is happening now, the good and bad things that I see just depend on what I just did or what I just was. And this clearly is limiting, okay? It's not wrong, but it's limiting. You might want to find other choices. For instance, maybe the responsibility of what is happening now is not just what happened at the previous step. Maybe it depends on two steps back, on three steps back. So you can incorporate this notion of credit assignment. So to give credit to states that you visited in the past for the improvement that you're receiving now during the learning, okay, in a better way. And this better way leads to a class of algorithms which are called the TD Lambda algorithms, where this Lambda is actually a measure of the memory of the states that you've visited in the past, okay? So TD0 means that basically you don't blame or you don't give positive credit to what happened except from the very last step in your visiting around the states. But with Lambda, you go back in the future. Yeah, sorry, you go. Yeah, okay, it's complicated. So there are both backwards and forwards interpretation of TD Lambda. I am mixing the two. So I want to say you go back in the past to give credit, okay? And what is interesting is that when Lambda approaches one, this becomes Monte Carlo, okay? So in a sense, this class of algorithms encompasses the full spread between strong bootstrap in TD node and no bootstrap at all in TD1, okay? So we will have no time at all to go through this with the proofs. Good news is that everything is explained very clearly in the Sutton and Bartow. So if you want to go through that, it's given the background, it should be an easy read, okay? Second good news is that tomorrow, Emorella will give you examples. So seeing how it works in practice, you can get a sense of what it means, okay? Third good news is that all these concepts, temporal difference error, eligibility differences have neural correlates in our brain, okay? So there exist things in our brain, like dopamine signaling and many others, neurotransmitters, the way cells fire, which are, I wouldn't dare to say exact, but extremely close, biological implementation of the concepts of the algorithms that we discussed here, okay? Fourth good news is that, sorry, your students, you have to do a lot of work to get all the math that's behind all these things, okay? So if you want to delve deeper, we cannot do this in these classes, but I can point you to references and the Sutton and Bartow is a very good book for all these references to neuroscience, to psychology, to opera and conditioning and conditioning. So I invite you to take a deep dive into the book for all the things that are not able to cover. Okay, so plan is tomorrow tutorial with the things happening for real. And from next week, we go back to our original goal, which is to merge learning with control, okay? This part was learning to predict and then we learned to predict to control. And, okay, I think I'm done, but happy to take questions if you have. Okay, if not, have a nice day and see you tomorrow at night. Thank you, goodbye. Bye. See you tomorrow. Thank you, goodbye. Thank you, goodbye. Bye-bye.