 Okay, good. I should be recording now. Okay, so I'm assuming everybody is here. If you know somebody is still coming, I can wait a few more minutes. Okay, now I will start then. So the topic of today's lecture is about state value evaluation. So we will not try to find the optimal policy, but we will just have the policy and we want to evaluate its value. In the model three case, so up until now we always knew everything about the model. Then we knew the model, but we didn't know the state. Now we will see that we will know absolutely nothing about the model. The lecture will be divided in two. Roughly speaking, of course the times are a bit random as usual. The first half will be about the first basic methods. So essentially Monte Carlo and then we will use the basic definition of value. We will define the Monte Carlo method and we will see how it works. Then we will switch to learning rates. We will see what we mean by that. Then we will do the temporal difference and we will try to mix Monte Carlo and temporal difference. And after the break, we will do something which is a bit more convoluted from a mathematical part, but it's actually very simple. We will see to the end. So we will do temporal difference lambda and we will generally speak about the eligibility traces, which are quite common and useful tool. We will show it will help us to retain some sort of short-term memory of past experience when we do the update. As you've seen, I decided to share a notebook, which is an incomplete notebook, so you can follow on your own. I left some part blanks and then I think that you can decide whether we can... I can give you very short period of time and you can think about how to fill those blanks and then I can give you a solution or which can just skip it and I will provide the solution. In any case, after the lecture, everything will be available to you. Okay, so let's start. Okay, so as I said, the topic now is how to evaluate the value function. And so we are dealing with tasks in which we have a policy, but we have no idea of the underlying model of the environment in which we live. But we can of course have some access to it. We have access to it in the form of the experience. So we have... we can produce and produce and produce trajectories, which are just the set of consecutive state action and rewards. As the environment... so the state is exactly the state I mean. They do an action following my policy. I get a reward, a new state and I follow it until termination. And I can do how many trajectories I want. And this is the only information I have on the model, but we can see that just having this information we can evaluate the value. Indeed, the value function of the policy, so the value function of the state given a policy, has the proper definition which is just the expected return when I start in that state and I follow the policy from that point on onwards. So this is the first definition of the value, the value v given a policy pi of the state x is the expected sum from all consecutive times of a discounted reward. So gamma to t reward t, if I start exactly in that time zero in that position s. And this expectation pi, it means that all the action has to be taken with a probability given by my policy. And we will see that this is already enough to have our first class of algorithms, which will be called Monte Carlo evaluation. The second part we deal with temporal difference will exploit the fact that we also have a recursive property of the value function, which says something different and you have seen it in the lecture, two lectures ago with Professor Cielani, which is the value, the expected value of the sum of the immediate reward, plus the discounted value of the successive in the successive state. Again, if you start in the right state and you follow the policy. So you see that the first definition is just needs the returns. If you know the returns, you can calculate it. The second one, you need to know the value in the next state to measure it. So it will be two different ways to deal with the two definition. So we want to, we want to try the methods and then to describe them and that you've seen in the in the theory class on an extremely simple model and this model is just a random walk. In the random walk model, we mean that the environment is just a string of states. Now here they are called A, B, C, D, E. We will call this numbers starting from zero, zero, one, two, three, four. And you can have two in every state you have two actions just go left and go right and these actions do what they say. So if you take action left, you always go left. If you take action right, you always go right. There are no reward at all, except for when you, you are in the last state, and you do a right action, you get a reward of one. And if you end up either in this square here or in this square here, the episode terminates. So these are terminal states of the random walk. This is the environment. We, we want to do a random walk, which is, which means that we are taking a special policy. Of course, if we want to study the random walk, what policy should we should be take? Small question for you all, just to start you off. What is, what is the random walk, the basic to random walk? I can, I can, I can choose one of you and maybe you can just. Is it one? Perfect. Yes, it's one off one off. Okay, good. So it's just, this is the policy which says, okay, I have one off one house, I can go left or right. I have no bias. So we will try to evaluate the policy of going half the time left, half the time right everywhere. So the policy is the same in every state. And I want to know the value of each point. And we want to, we will use algorithms to do that. But of course, this is a very simple problem. We could also know what is the optimal value before doing any simulation, because we can actually calculate the optimal value. And this is, I wanted to ask you this, but we can skip it. It's very simple actually that the fact is that for each point you have, you, if you remember the Bellman equation, your value is now we can write 0.5 here and 0.5 here. So what is it? Essentially the value in the state is half of the value from left, half of the value from right, because the reward is always zero except from the last part. In all intermediate states, the value is the mean average of the two value left and right, except for the last, last place in which you have the mean average between one, which is going right, and the value of the left part. So essentially it's very simple. This means that since you're doing always the mean value of left and right, mean value of left and right, this means that this is a linear function which goes from zero to one from the top. So the optimal value we can write in this case here, for example, it's going to be 1, 1, 6, then 2, 6, 3, 6, 4, 6, and 5, 6. Okay, this is the optimal value. So essentially, all we are going to do today is to find these very simple numbers which we already know for a very stupid system, but the way we will find them again is going to be the interesting part. So from the point of view of the simulation, I just defined this class here, which is the class of the environment. I've defined it much more complicated than it needed to be, because actually this is a way in which environment are generally structured also in libraries online. So if you go to open IE libraries for environment, generally speaking, the idea of a structure in which they are written is the same. So I wanted you to have something to compare. Okay, it's not exactly the same idea as the same structure. So generally speaking, it has an initializer in which you just define what is the observable space. In this case, it's very simple. We have just n states. I'm not considering the terminal states. And the space is just literally the array of what 0, 1, 2, 3, 4, 5 up to the size. The actual space we have to define is just minus 1 and 1 left and right. And we just decided we start in the beginning. We will always find in this kind of libraries with real environment. We will also find a reset function, which just brings everything back to the start. And then we will always find a step function, which takes just the action. So in this case, we'll take just minus 1 and 1. We'll create the new state, we'll create the new reward, and we'll give the new state the reward and a flag which says I've reached termination of the episode or I didn't. This case is very simple. The current state will be incremented by the action, so it will either minus 1 or 1. And then I will just check whether the done condition is satisfied. In our case, remember the condition was if you go to here, so if you go to minus 1 or if you go to n plus 1. And this is what it's written here. So if you either you go to minus 1 or you go to plus 1. And if you go to plus 1, the reward is going to be 1. And you're done. If you go to minus 1, the reward is 0 and you're done. And in all other cases, the reward is 0 and you are not done. So the flag is false. Okay, just to point out that generally speaking, all these libraries online, you will find a render which just helps when you use this function, it will pop up some picture of the state of the system. I didn't put it here because it was simple, but this is the way a general environment is created in most of the libraries you can find online. Let's try it out. So I just create this random work environment and I'm doing five episodes. I start saying, okay. I'm checking whether the episode is finished. And I just take this current state. I take an action randomly with this policy. I'm giving the action to the step function so I will get the new state, the new reward and whether the episode is done or not. And I'm just printing it. Okay, this is just to check if everything is working. And this is a typical episode. This is an entire episode. You see, it's not done until it's done. And I have the state. I have the action which I took, which it was minus one. Indeed, my new state is one because I was two minus one one and I got no reward. And I got no reward. And then you see it dance because the new, the final state is minus one. And then on the left, this means that it's done zero reward. Okay, this is all the information our different algorithms will have access to. So instead of having like we did with dynamic programming, which you have the proper property probabilities, the function of the reward, the transition, no, this is all you have to evaluate the value function. Okay, we will have to create algorithms which take this list of numbers and create the evaluation of the function. Okay, the first algorithm. So, as always, if you have questions, please ask at any point. Okay, the first algorithm is the estimation by Monte Carlo, which does basically what it's written here in this section. Okay, essentially, we will just, we want to collect, we want to collect the values of this return, and we want to average them out. This is what this definition is. So the idea is that the agent interacts with the environment to produce many, many, many trajectories from the trajectory. I can calculate the returns at each time. So if I get a trajectory from zero to to the termination, I just sum over all the discounted reward. And this is my return for that time there. And then I associate it, associate, I link the return of that time with the state at that time. Okay, so for each state I have, I have all the returns, which have really started in that state there. This is going to be called the first visit Monte Carlo method, because since I want all the in this average I want all terms to be completely independent. In one trajectory I will take only the first time I visit the state, I will calculate the return for the first time, and I will only keep that one. So for each trajectory, the first time I find I found the state. I'm calculating the return from that for from that point onwards and I will associate the state with that. Okay, then I will have to keep in memory all the returns, which I linked to that state, and the values are not nothing more than just the average of that. So the value will be the average of all the returns I've experienced starting from that state. Okay, the pseudo code is practically the same as the code. I want a policy to be evaluated, which is my half left and right. I initialize the values arbitrarily, I will just initialize them to zero. I will have a list of returns which are basically zero, which means I have not visited anything yet. And then for each episode, I'm going to start from the beginning, I'm going to sweep up to the end. And then I will calculate the returns for each time step. And then if it's the first time I've seen that state, I will just say, okay, let's append to that list of that state, the return I've just saw. Okay, I have a question about this. This is just a very, very, very helpful tool which computes the discounted sum. So this is just a function, which was written magic for computing discounted even when I found it. Sorry, can you re-explain the last step of the algorithm? Okay, sure. So we want to evaluate this function here, right? We want to do this formula, which is the expected value, so in a sense the average value of all the sum from one moment onwards of the discounted reward if you start from a state. So what you do here in this pseudo code is you calculate the return, which means the sum of all the future discounted reward from that point onwards, and you calculate for each point of your trajectory. And then if you are in a state ST, which you have not seen before, so it did not appear in the state before, you take that value of return, which is the sum of, from that point, gamma reward, first reward, gamma, second reward, gamma squared, third reward, etc, etc. You take that reward and you store in a list, which is connected to that state. For example, I do the first trajectory and I calculate the state zero, I found it and the return from that was three. And then I do another trajectory in the state zero, I found the reward of five. So I will keep in memory three, five, I do another trajectory and I never find the state zero. Okay, do another trajectory and I found again five. And then at the end I will have in memory associated to the state zero, a list of values which are the different returns which I found in the different trajectories, three, five, five, then six, one. And then I know that the value is just the average of those values, of those numbers I found. Is it more clear? Yes, thank you. Perfect. You, I mean, thank you for the question. Okay. So, exactly as I said, this is the pseudocode, now we're going to the code. Again, we define a class which is just our algorithm. And it just needs to know what is the system size, what is the gamma we want to use for the discount. And now I'm just doing a list of empty arrays. And, and I want one empty array for each state so I want this list which if I have a size equal five I will have five zero array. Then the update at the end of the trajectory I will require from from the user I will require what is the trajectory of the states and what is the trajectory of the rewards. What is my, my similar episode update. I will just say okay, I will at the beginning of the episode for sure you have not visited any state yet because you have not done anything. Then you calculate the return. This is a function which returns the for all time step, the cumulative sum afterwards so it's the truly the now we have an array with all the returns. So we have zero G1 G2 G3 G4, etc, etc. And then we just have to do the last part of the algorithm, which is for all the time step. This is the time step as it will be the visited state. And then we just want to do what we just want to append to the list of the returns of that state. That return, if we have not visited that state. Okay. So I will, I will, I will leave you like 40 seconds. You don't need to write it because I will give you, but I want you to think of what what here the code should be. Okay. And if you have no idea, then you can ask me and you will see that maybe this will be helpful for you to have a mnemonic idea of what we're doing. Have you thought what what goes? Are you completely lost? If you're completely lost, it's good in the sense that you can ask questions. Otherwise, I can show you the solution. I go on, right? I mean, those few with with with come on, please note sometimes. Okay, what I said, you what you need to do is if you've not seen the state yet, you should store in memory, the return connected to that step. This is what exactly what goes here. If the state s was not in the visited list, then for that state s, you should append so add to the list. The return coming from that step because this is exactly what is some from zero to infinity gamma reward. Okay. I'm just store it a new one. I found a new time. And then I was just saying, okay, you know what, I have visited this state. So the visited list now contains also the new state. If the visited is the length of the visited list, if the same as my size, if I do not need to go further because I know that I saw everything. This is Monte Carlo first. Every time you see a new state, you will store the return. And then it's very simple. I just need to estimate the value. I have the store of all the returns I saw. I do the average value and this is my main. Okay. So what is, there are two problems with this algorithm. The first problem is that he needs to reach the end of an episode to make it guess. So if you, you cannot do it at house when you need and which it can be very costly. The second thing is that he needs to store in memory all the return. So you can imagine that here we can. But if you have a system which you have to store, start to store like millions of numbers, it gets really uncomfortable. So for this case, we go into the learning rate. Essentially, instead of storing the returns and then you do the average of returns, you do this, which is like a stochastic way to do that. So I do the same thing, basically, but then instead of, of just storing the return, I do an update and update. It means that whenever, whenever I find the state, I say, okay, my last approximation of the value was V. And now I will add a small quantity alpha multiplied by the return, which I found minus the value. So since the return is what I expect in the end to average out of the value this G minus V goes in direction, which at the end will shoot should bring me to the good part. Okay, so I will do an update now. I do not need to store whenever I got a return, just say, okay, the new value is the value plus a small difference between the difference between the return I got and the value. And you can see that if alpha was one, it means that basically every time your value is assigned to the return I've just saw. It's like, okay, I will forget my past. This is my new return, this is my new value. If alpha is smaller than one, it mixes up all the returns I have found so far. Okay. And as before, I have a small code, which is again very simple because it's as before. Now I don't even care about visiting because visiting was about independent trials. Now I'm mixing everything up. So what I have to do is just to calculate the return, which are calculated here. And for all the steps, all the states, I just need to update my value with this small quantity alpha, the return minus the value. Okay, so you can think a second about that. It's even simpler than before. For all the times, for all the steps, the only thing you need to do some with alpha, which I apologize, now it's called learning rate of the value, which is a very standard name in the language. Multiply by the return, which is G, the return I found at that step in that state minus the previously old value for that state. Okay, so the value for that particular state which I visited, it's equal to the return for that time step minus the value which I previously thought it was the right value. Okay, good. Now let's estimate how this work. And this will be something which I will do with different algorithm. It's very simple. I will do hundreds of episodes. I will, now we can put the random policy, which was the debt of random work. I'm putting gamma equal one, just for simplicity, so there is no discounted reward. So we know that the optimal value is one fifth, two fifths, three fifths, four fifths, six fifths. I'm taking a random alpha, which is 0.5. The optimal value we said that now we know what they were, what they are, one sixth, two sixths, three sixths, four sixths, six sixths. And I will create an environment and I will try to do both of the, I will create both of the algorithm. And I want to evaluate the, how different the values measured by the algorithm are. If I do an episode of learning, and I will average out between many, many different steps. Okay, so this is for each algorithm will have a number of episodes. All, so and I will have to store the trajectory for each episode. I will reset the environment. I will do the episode as before you see the action is the random action between left and right. And I will get a new state, a new reward, and I will essentially retain in memory the trajectory for the state trajectory for the world. And then I will say to the Monte Carlo, the two algorithms or Monte Carlo with the first visit Monte Carlo with learning rate. I will give them the two trajectories and I will say, okay, give me an evaluated of the values. And then since I know what are the values in the optimal, what are the optimal values, what are the real values. I will just check what is the error between what I got at a certain number of episodes of learning and what is the real one. Okay, as always, I forget to take a bit because I asked him to do a large number of trajectories. Okay, so this is a typical efficiency curve. You can see that the first visit Monte Carlo, it's rather robust. It goes to zero quite well. This is the error. This is the error for all the episodes. After 100 episodes, the other error for the first visit Monte Carlo is the random score. And you can see that Monte Carlo with the constant learning rate actually learns quite well and then fluctuates around. And this is the reason why you know that the learning rate actually should go down. And there are the condition for the learning rate in time to be sure that the algorithms converge with probability one. And we are not using that. We are using a custom algorithm which is known to have this problem. It will go fast toward the solution when it fluctuates. And you can clearly see that, for example, if I have a much larger learning rate, then I will have even worse performance. But if I have a very, so this is because I'm trying to change my value much faster. So the beginning I learned very fast because I was very far from the right solution, but then it's a mess. And if I do a very, very, very small, this will in the end will be a much, much better solution. Okay, so we've done the first very simple thing, which was Monte Carlo. We had a way to define the solution as an average and we exploit that average. We have not used the Markovianity. We have not used the recursive property of the value function. We have not used anything. Now we will do exactly that. So now we will use the property that the value can be explained as it can be given as the instantaneous return plus the discounted value of the next state. Okay. And as you've seen, you can define what is called the temporal difference error or TD delta, which is the reward plus gamma, the value in the next state minus the values in this state here. And you can see that it's defined, it's taken as above. Above you have the value, reward plus gamma, the value in the next state. Here the delta is defined exactly as the reward plus gamma value in the next state minus the values in this state. And the reason for that is very simple, which is this delta here should go to zero. Because exactly as the value is expected, the expected average of return plus gamma value. If you do the return plus gamma value minus value, the expectancy is to get to zero. And the meaning of delta is actually very simple to understand because the meaning of delta is the reward plus gamma value, which is, okay, let's do it in another way. The reward I got, which is the real return which I, the real reward which I experienced now, so the reward I got now. And the other thing is the reward which I, this V minus gamma V. I apologize one second. Okay, so R, R, D, T is the reward I just got. And the value minus gamma, the value in the next state, it's the expected reward which I should have got. Because this is clearly, since one is the value at time T and the other is gamma value of time, time plus one. This is the expected reward I should have got now. And this is what I really got now. So delta is basically the difference between what they truly experienced as the reward and what is the expected reward which I should have gotten at the time T. This is what it means instead of a mathematical set. Again, so the pseudo code, so what is the truly nice thing about this? That this does not require a whole epidemic because the definition of delta requires only the state at time T. The reward, the action, the action does not, but you have to do an action. You have to get a new reward. You have to get a new state. So if you do one single step, you already can evaluate delta. And then you can do this update. Okay, so Monte Carlo requires the whole trajectory of the whole episode to do one single update and temporal difference requires only one step. And this is indeed the pseudo code. Again, you want, you have a policy to be evaluated. You have this step size alpha constant very small. You start with the value which are randomly, but the value for the terminal states at zero. And then essentially whenever you do whatever episode you're doing, whenever you're doing a new an action from a state S. You have, you can calculate this delta, which is the reward you really took plus gamma, the value you think it is a new state minus the value. This is that state there. This is delta. And you update the value by a small quantity by this difference here, which is the difference between what you experienced are what you expected gamma V minus. Okay, and you do it over and over every step you can do this update. Again, I'm a bit late, but then you can you can see that again I created an algorithm class, which is very simple. And we have a very simple step update. Now, no, it's not a single episode of the date. You do not require the whole trajectory. You just require very simply a state where a word and the new state. And okay, you can again spend a few seconds to think what is the update to write here. And it's just this line here. The only thing which needs to be dealt with care is the fact that if it's an episode we've done, then I do not have the new value because the value of all the terminal states. That is the only thing that has to be taken with care. So each state, I will calculate the value in the new state or I will put zero if a episode is finished. I will do reward plus gamma value in the new state minus the value in the state in which I mean, and I will update it. And this is exactly what is done. If I am done, so the episode is finished, the delta is equal to reward plus zero because I am in a terminal state minus the value I have in that state. If I'm not done, the delta is the reward plus gamma multiplied by value in the new state minus the value in the state I mean. I do just a small update. The value is, I just sum this small alpha, the small coefficient, which is now is a learning grade, sorry, and I will multiply my data. So this entire temporal difference algorithm is actually essentially four lines and it's very, very simple to write. The whole idea of today is that all of these things are very simple to implement in the most basic. Again, I just wanted to, let me copy this because I will need it otherwise. I apologize here. Okay, here what I do is the same as before. So I do 50 runs of averaging and I do 100 episodes. And now I compare the results I get with this temporal difference zero, which you just take every step. You see now, instead of waiting for the end of the episode at each step, I'm saying, okay, I have a new state, new S. I have the old state S. I have a new reward arm. I can ask to the method of temporal difference zero to do a new update of the value. Okay. Before the update was after the whole episode. Now they updated in between single steps of that. Okay. And you can see now, essentially, what happens for that. And it's very, very simply. It does something which is comparable to the. Okay. So I'm running a bit late. So perhaps now we take 10 minutes of break. But if you have questions, please ask them now. And the next part will deal with what way to meet Monte Carlo. Okay. So we are back. If any of you have questions, please ask. Otherwise, I will want to do a quick recap. So I just want to stress, basically, three small things. The first is what we've done in all these algorithms here so far is that we have taken. We have taken something which we know should be an evaluator of the value for Monte Carlo. What we compare what we assume is the value is the return. In this sense, our value is the expected return, complete return, discounted some to infinity. And the algorithm so takes account of that and takes our return disease and tries to average them out in order to evaluate the value in the temporal difference. So this is what is the return is the return, which is called sometimes GT T plus one. So which is the return given by this, but it's the time is the reward plus gamma V. So again, what we are doing is we are trying to make our assumption evaluation of the value as close as possible to what I see as R plus gamma V. Okay, so in a sense, Monte Carlo takes the complete return and tries to to average the complete return to get the value. The temporal difference takes the instantaneous reward plus gamma V and that return, which is that one step return tries to assign it to the value. Now we will see other things, but what I want to so this is basic idea also in the future we will see that we will take some kind of evaluation of the return and then we will try to assign to the value. What I wanted to stress here, which perhaps before I was not I did not stress enough is that this method. What does for this method to be more than three means that they require only trajectory. Okay, so we have done this, this very simple random walk environment very simple random policy, but this works. This this small script with we share with you works for any, any environment, any policy, as long as they are a table, actually table space. So here I have an environment random walk and I have a policy which is half half. Okay, but if I were to have an environment to which is super secret spacecraft. NASA tells millions of euros and the policy is some some strange thing which requires a million of states, I don't care, as long as I have a list of states and the list of rewards. This very simple, very basic algorithm can produce a state value. Okay, because the model inside this, this algorithm time the model does not appear anywhere except that for the size. The learning algorithm must know how many states there are and trajectories. So what what is really producing the trajectory for this is completely irrelevant. Okay, so we are doing the most of the simplest random walk random policy, just because we want to have a benchmark, but this very few lines are extremely general. Okay, I just wanted to stress. Okay, so let's move on. I will do very quickly what I've just said before so Monte Carlo. What they want to do is that they wait at the end of an episode so that they can evaluate the return the complete return remember that the complete return is just defined as the reward. The discounted reward then discounted reward afterwards, etc, etc. Okay. And this, they, the point of Monte Carlo under it is that they want to update the value using the complete return. The one step temporal difference also no no TD zero. Instead, they can update every step but they work with what we use is that they say okay, we take care of the one step return, which is defined as the reward plus gamma value. And I'm going to date my value with using only the one step return. We can mix together the thing and we can see that there is a whole class of metals which are called and step different temporal difference, which use the end step return the end step is just something in between Monte Carlo and different. For example, the, let's say the end step is GT t plus n, which means that the first n steps, it takes the experienced reward. So it's the reward instantaneous reward plus gamma discounted reward. Then it's gamma squared t plus two plus plus plus all the discounted, but the last step, instead of using the experience, it uses information which he has. So clearly it's between the two. But it does it mean means that the whole class of algorithm which we update the value using essentially what we have as the end step return. Okay, so as before, my new value is going to be my old value plus this learning rate, the return I just experienced minus the value I had. So the entire entity of these classes of answering essentially said, what is my approximate approximate approximate of the value can be the complete return one step return and step return. I don't care. My update is my old value, and I will add a small part of the difference between what I've just experienced, which could be a complete return one step return and return and that my previously evaluation of the slide. Okay, so this is very good. I will not even go into the into the code you will find because I mean it was just to show you that there is a very simple way to to care about Monte Carlo and question about this. Now we go unfortunately in a pattern which is a bit more technical, but the I, there is some mathematics but very simple message, so I will, I hope that at least the message with the can be used. So, why it's called temporal different zero zero, because there is actually one specific case of a whole class, which is called td lambda, where lambda is the number between zero one, we will see that lambda equals zero brings back temporal different zero and lambda equal one brings back Monte Carlo. So, temporal difference lambda it's very elegant because actually allows us to bridge these two concepts of Monte Carlo and temporal difference. What it does, again, it uses, it defines a return which is called the GT lambda, so the lambda return, which is something which includes all the returns we've seen so far. So, the GT lambda takes GT t plus one, which is the the current, the one step temporal difference returns of the return we have defined as the instantaneous reward plus a value for the next step. And it takes, it takes with weight one of the one minus lambda don't care, but the next one so the GT t plus two takes with the decaying factor lambda the GT t plus three with gamma squared. Okay, so mixes all the returns we have defined so far into one single returns. Okay, so we can write it as as we are. Sorry. Since I have to not look at the correct. Okay, so the definition of this return, which is called the lambda return, it's just the summation of all the returns of one step two step three step four step five step six step, etc, etc. Waited with a decay factor. Okay. Essentially, it means that you, let's say you can rewrite it as you want, if you wish, in two terms, since after termination, the reward, the return is just a complete return. After termination, all those terms go back into one term, which is this. So you can this infinite sum, you can split in the sum up to termination, which is the first sum, which, which is a sum in terms in which you mix experience the reward and value function at some point. Okay, so it's a mix of experience as and expected values. And a last term, which is just the last, last, the complete return. But this is just to show that if you split them like this, you can see that if you put lambda equal to zero. Then, of course, the second term dies, and all the terms with gamma with lambda and more than one die, and only one term remains, which is G D T T plus one. So if lambda equals zero, the return is the one step return. If lambda equal to one, the whole part here dies because one minus lambda goes zero. And in this term here is the remains just G T, which is the complete return. So lambda is a parameter which allows us to go from zero, which is one step temporal difference to one, which is completely on the card. So why do we want something like that? Because we have seen that, or also you've seen in the classes before, that temporal difference is very good because it bootstraps the information. So it does not require, it does not use only the experience you have in that moment, but if you use it also the information you have accumulated before. Because if you have a good notion of the value before, then you can use it in this one step return. But on the other side, Monte Carlo has a very good, I mean, you cannot have wrong information. You can have wrong information of the value. If you, for example, if you start with a value expectation very far from the correct one, every time you do one step temporal difference, that error will be brought into the definition. But if you do Monte Carlo, you do not require a previous knowledge of anything. Everything is exactly the experience. So it has no bias. Okay. But so one has a very large variance but has no bias. This is Monte Carlo. And the other one, the temporal difference has a very low variance, but they can have a huge bias and this is temporal difference. So having a way which switches between one and the other, which is TD lambda, is a very helpful way to mix the two. It's known that the TD lambda with an intermediate value of lambda works better than the two parts. Full Monte Carlo, full temporal difference. Okay. The problem is, as you can see, that as defined, you again require all steps. Okay. So temporal difference zero, you require only one step and you can do an update. But here you have one step update, one step return, two step return, three step return, four step return, and etc. So in this, we can see as a forward view. So the return at time t lambda is given by the all the returns in the future, which is not useful because again as Monte Carlo, you need to wait the whole episode and go back, which is a waste of time. Now, there is a way to deal with that, which deals with some way of retaining, retaining memory of your past visits, which are, it's called eligibility traces. Okay. So I, I wish I want to split this next part into two parts. One is just hand, hand waving, which is saying that what are eligibility traces? Allegibility traces are a way in which you store information about where you were, and you assign updates to, instead of only the state you've just visited, to all states. But with the weight, with changes, changes upon how far ago you visited the state. Okay. So eligibility traces, in a sense, are a tool which we can, we can think of as a, as a empirical tool. Just, I don't want to change on your state, but if I have, let's say I have experienced state zero and then as the experience state one, I don't want to change only state one. I don't want to change also state zero because, I mean, it was just before me. So whatever if I now find the state one is very good, then also state zero should be very good because I just experienced it. And in this way, we just want to see the eligibility traces as a tool of having memory of the past and changing all states with some memory of the past. And this can be a message which is independent by what I'm showing now. What I want to show now is that in the basic form, eligibility traces are a way to compute lambda return, which are returns in the future, without having to wait for the final, for the end of an episode. Okay. Keep in mind these two things. I have a way to combine Monte Carlo and temporal difference here, which is called temporal difference lambda, which uses lambda returns, which in principle are something which had to be computed in the future. There is a way, which is called eligibility traces to do it at each step instead of at the end of an episode. This tool, which is eligibility traces is also used in general as a tool to have a memory of where you have just been, and this helps to increase efficiency a lot. I hope I'm not completely lost to you now. I will try to do the mathematics and then we will hopefully see what I mean. Okay. So we have said that TD lambda aims to do this update here. I want to change my value with a small alpha so that my value is as close as possible to the reward return gt lambda. Okay. And now let's expand gt lambda gt lambda by definition. It's equal to one minus lambda. The first one step gt. So instantaneous reward plus gamma v. This is return one step plus lambda to step return, which is real. This is the real reward plus gamma r plus gamma squared, the value. Then it's the three step, which is as a weight gamma squared, and it has the reward plus reward plus gamma squared, the other word plus value. And again and again. It has all the terms, all of these have some lambda to some power and have many actual reward plus gamma to some power, the value. Okay. This is if I expand the return. Now you can see that here something happens in the, let's see. You can see now have something happens in these columns here. Actually, this column here are something appears over and over. So for example, the reward of time t plus one appears all the times so you just can sum up all of them. And in particular, you have to sum up all of them have a weight one minus gamma and then a one minus lambda, I'm sorry, lambda two zero lambda one square, etc, etc. So all of this here some up some up and you since the sum of infinite lambda to the end, it's equal to one minus one minus lambda. We contain our time t plus one sum up exactly to our time t plus one. So all the factor cancel out. Okay. The one with our t plus two are exactly the same but they have one additional gamma and one additional lambda. So they have a sum up to gamma lambda are t plus two. Again, those with our t plus three sum up to gamma lambda two squared are t plus three, all with our t plus four sum up to gamma lambda to the tree, etc, etc. Okay, and then the value you can see that the value actually appears only once appear only once with two terms. One gamma bt minus lambda gamma bt. Okay. So actually, all of this strange thing here can be rewritten into this sum here. It's exactly as before. I just said that all the terms are sum up to just this all the term are t plus two sum up to this. And then I took the value which had only one one one term gamma lambda gamma and minus lambda gamma to the end. And I just written like this. Now you can see that actually this term here correspond to something very, very similar to my delta. This is the reward plus gamma v minus v when you have only one passage of space of state. This is the t plus one t plus two and I have a reward. So I can use this in the sense that if I want to do an exchange of time t, it's like I'm keeping on doing an update to this state here. So the value of t, it's like it's, I do it with weight one. I'm doing it at t equal t. At the next step, I'm again updating the same step, the same state, even if I'm now in another state, I'm updating the same value or the same state, but with a decaying factor. Okay. It's like at t time t, I'm in the state and I have memory of being there. Then I move away, but I have a memory which decays with time of being there. Because at time t plus one, I'm somewhere else, but I'm still trying to change the value of the state t with a decaying factor. Then at time t plus two, I'm an even other place as t plus two, but I remember that I was there two times ago. And so I will still, still increment that value there, but with even smaller terms, then I will move again, but they will still try to update that value there. Okay. Now, there are some subtleties all around, but the point is that eligibility traces as a general rule, keep track of where you've been in the past. And change the value, update the value of where you've been in the past with what is the current estimator delta. Okay, so you are somewhere you calculate what is the delta because I was here and I did this action I got the reward and this is my new value. So this is delta which usually, this delta here usually enters in the calculation of only the value of s t plus two, but instead we use the same delta which normally is being used only for s t plus two, also for s t, but with a very small factor. Okay. So, eligibility traces in the most extremely simple way are just some weights, which are weight for all of my states, which at the beginning are zero. Then if I go into a state, I get plus one. And if I move away from the state, I just get a multiplication lambda gamma. So I have weights everywhere, which are zero, then I visit the place, it goes to one, and then it goes down. And I visit again, goes plus one, and then it goes down. And at each step, instead of updating the value only for the state I am in, I updated the value everywhere with weight, parallel, proportional to eligibility traces. So what I wanted to say is that all that we've done so far, essentially, the instantaneous reward and the standard return were used to update the value of a single state. So if at t equal three, I was in three, all my different evaluation of the return were used to update the value of state three. But with the eligibility traces, at time t equal three, I am a state three, but all the values everywhere get updated. But some more and some less, depending on how long we go, I was there. So it's an extremely helpful tool because it connects past experience to present experience in a way which none other can do. This was why I was saying that there is a connection between eligibility traces seen as memory and eligibility traces seen as a way to implement TD lambda. I hope this was a bit clearer. The nice thing because right up to now it seems that it was a mess from the point of view of mathematics. The nice thing is extremely simple to implement eligibility traces, at least in this very simple way. This is the pseudo code, which is essentially the same as before. We initialize values arbitrarily. We initialize eligibility traces to zero because we have not visited anything yet. We start with an episode. I take an action. I do the action. I observe the reward. I observe the state. I'm doing the usual time different zero delta, which is r plus gamma z minus d. And then in that state, since I've been in that state, that state got eligibility trace plus one. And then for all the states, not the state I mean for all the states, I do an update which is proportional to the learning rate and the eligibility trace in that space there. And then all eligibility trace for all spaces get multiplied and decay with the fact that gamma lambda. And then I'm in a new place and I take another action. I collect a new reward. I collect a new state and I do the same. I, the eligibility trace in the new state got plus one. But then all the states everywhere get updated with this thing. There are many subtleties. So backward and forward expression as I defined here are perfectly equivalent only if at each time I were to use the value in the as defined in the past, which is not true. So you see that now I will save it. I will save it at t plus two. I this is the delta. This is not actually the delta because the delta because the delta is is actually using the value of the as defined in t plus two. Okay, so these two are not perfectly equivalent if I do a one one step by step update. But most of what I said is true. And the general gist of eligibility traces is there. Okay. I have you. Do you have question about this? I know this, this is a bit tricky, but I hope that the main message of memory is there. Do you have questions? You have too many questions and you don't want to start. Okay. What is when I think, as I said, when I think and look at the code, the code. The only thing that different from before is that I need an eligibility trace. Okay. And I need an eligibility factor. Decay in a number again between zero one. And I need to just store a new vector, which has the same size as the observation state. That's it. Then what is the single step update is the same as before because I have to create a delta. I just need to do this very simple operation on the allegedly traces. Either the legitimate traces get plus one and then became or they just became. Okay. So you can think about them two seconds. But essentially what we have said and what we will see now is that you have to update the legibility traces because I just visited the state. One of those legibility state we get plus one. Then I define the delta as exactly as in temporal distance zero. And then instead of updating only one state and update updating all of the state with the legibility traces. And then I decay all of the. So in a sense, if you want, here I have a step three, which is apply. And then I have the sports. Okay. Indeed, we have to write it now. This is, I mean, as always, they are super, super simple. I update the legibility traces for the current field of the state. Self legitimate trace in the status plus one. I define lambda as before and I take care that the end of the episode I don't have a value for single state. And then apply the update, which is proportional to delta is proportional to the learning rate, but it's proportional to this to the legibility trace everywhere because now this is a factor of values and this is a factor of the legibility. And then last thing I do. I apply the decay factor. If the episode is finished, then all the legibility traces go back to the game. What are in general legibility traces we have seen them like come up from this TD lambda. So this return which mixes all the one step to step three step returns, but actually, they can just be seen and they are generally used in many forms as the memory of the past system. I go through my trajectory. It's like I'm lighting up some memory of where I am. And then the update instead of being only localized on where the current system is located is spread to all the places which have been recently visited. So the main difference that in Monte Carlo on time different zero where if I am in ST at 90 only the value relative to that position is modified. Okay. And it's modified in Monte Carlo for the complete return in TD zero by the one step return or man step return doesn't matter. Only that value there is modified. But with eligibility traces, the return approximation, which again, one step return complete return. I don't care modify all states with some weight. Okay. It's called eligibility because in a sense, those are the state which are eligible eligible to change. So we will see that you will see them function a function approximation, you will not be in one state you will be in approximation of the same. So all the state we share some feature will be changed. Okay. And essentially you, we can see what is the last thing is just let's do again. We take a learning rate. We take a lambda. We know what is the optimal value. We calculate the empirical error. We create a random walk. We do some average runs. And we do some number of episodes and we see what is the average performance of the, of the D lambda. And you can find it. For example, you can see the TD lambda. Okay. Now the learning rate was very small. Let's say. This is very, very large. So it didn't even faster. I mean, now, essentially, I introduced a new hyperparameter, which I can change, but essentially the, this is a way to bridge the difference. Just one picture to show all the difference or difference between remember the TV one is just want to Carlo what happens. Let's, let's take just one episode. What is the change to the value of all states at the end of one episode. So I, I started with all the value equal to zero. I'm running up to the end of an episode. And I will try to see what are the changes, which the three different algorithms do to the values for the same thing. So I have the same environment around the walk. They all experienced the same trajectory. But now I want to see at the end of the first episode, what is the change? What is the actual value of the predicted starting from zero. And I will put just 0.1 as learning rate for all of three gamma equal one. Okay. And I do, I do size equal 10. And I'm using so lambda equal to 0.8 lambda zero, which is TD zero, and I'm using Monte Carlo. And actually you see why TD lambda is very, very nice because actually all recently, all, all, all algorithms that I created before can be written as a TD. In a sense, I could have started from the end and finished in 10 minutes to have been even less nice as a structure. I'm afraid that it couldn't be. Okay. So you see that after one episode, nothing changed. Can you guess why? Yes, I cannot hear you, but I know that you wanted to say something. Please unmute. I'm sure you have the correct answer. Is it right? No. Anyone wants to guess after one episode, all the value states zero. Remember, what is the environment? The environment is a random walk, which has two way of termination. Either it terminates on the right and gets plus one, but it terminates on the left and get zero. So by chance, the system just terminates on the left. So we had zero signal essentially. So zero reward, zero every, because I just, this is, this is just one episode can happen with, there is no update, but let's do another episode, which nothing happens. Okay. So now I have three curves and let me give me one second. I apologize. Before, so that we can see which one is which for this. Okay. Now we see the three curves. Evidently clearly, it's the single episode finished with a one. So to finish with a positive reward. Then I have three types of updates, which are rather different. So one is the temporal difference zero. Why is it like this? See, there is a simple reason that this temper and different zero as can see only one step. And then it uses the previous node information. So until he has a reward, the information, it's always zero. Only one step sees a signal. It's the last step in which I get the reward plus one. So it says, okay, this state here. I saw a word plus one. So now I'm having an information, but it's absolutely information only in that particular region there. So it goes up by what act by learning factor, but all the other step contributed to nothing. Then Monte Carlo, what he said, said that you have to keep the return for all states. So all the returns since this is gamma equal one, all the returns just some older reward up to now. So all the rewards are equal equal to one, which is the reward. They only rewarded to all the returns. I apologize are equal to one. So I just adding one multiplied by learning factor multiplied by all of the times that have been in the state. Since this is the random walk. I've been mostly around the center and then I exit. So you can see that the prediction since I'm adding that using the return, which is one for all of them. The prediction is actually the value is in is larger in the middle because it just I visited more, more, more time many times. The center region. But did the lambda remember he has a decaying memory. So the largest change is in the state which has been last seen. And indeed it already shows the shape which he has should be because the shape remember it's just one, one, six, two, six, three, six, four, six. So the shape should be linear. So time difference is weighted down by the fact that it's previous bias was total zero and only one, one, one state changes Monte Carlo as many changes, but they are wild, wild because he has it has a very large variance. And so it combines the efficiency of both as a good estimate at the end. Okay, this is where I got a truly a reward and it propagates back to all the state visited previously, but with the killing factor. So this basically are the three difference between the three hundreds. So this concludes the lecture. I hope that some of it was more clear than that some other part. And if you have so I will provide the completed code with the results of few correction of typos and etc, you will get this. So don't worry. You can go over it. I also pointed out sometimes I found, for example, for the legality trace for the back or forward. I found that this log, which is a part of a certain online material. I think it's very clear if you want to see more. I've tried to be to this, this random walk is the exercise 6.2 of the book. Again, they're also there. You can see that it's very some material. I apologize that one thing which I did not do is that I did not follow the convention using the lectures and so in the convection, the reward as a time incremented by one. I use the convention with the world as not time incremented by one. Please be careful of, for example, this is here. It's consistent with my notation, but my notation is, I mean, I just do it for convention. But actually, if you see it in your lecture should be our T plus one thing, but it's just convention. But just please take care of this and take note of this. I try to put as many comments as possible so that these arguments are simple. But they are simple by their nature. This I want to assure you. And if you have questions, if you want to expand on them, then you can also write to me. Let's stop.