 Okay, go back to sharing. Okay, perfect. So today we will deal with model theory control. So control means that we are going to not evaluate a given policy, but we are going to try to find an optimal policy. And model three is because we, again, we will not have access to all the details of the model, but we will only have access to it indirectly through trajectories. The lecture more or less is divided into the first part, I will build upon the past exercise on value evaluation. We will see how value evaluation works on the Q value function. And then we will use this to build the two methods, which are very similar, but I have important differences between them. And one is Q learning, the first, actually the first one we will deal with SARSA and then Q learning. Then most probably then one hour we will have gone past. So after the break we will deal something slightly more advanced than SARSA, which is expected SARSA, which in a sense connect both Q learning and SARSA. And then we will deal with just briefly on a topic which you have seen in the classes, in the theory classes. And it's always a nice thing to have as a tool, which is a bit of convergence discussion about the learning rates. And you will see how to change epsilon in time. So let's start for real. Okay, Q value learning. Last time we have deal with many ways to evaluate the value function, the state value function. So V of S given a policy. Right now we will construct on them, but we want to, in our main focus, we will be the control, so we're finding the optimal policy. But notice, I apologize for the inconvenience again, but now I switch back to the correct notation in the sense that notation, which is most common, which is now very worth the time after the action. So this is just a side note. Okay, for these algorithms of control, it will be more convenient to use the Q value, so the state action pair value and not the state value. So we will first see the definition again of this value function. And then so the definition of the value function for the state action pair is just the expected value. We lost you, I think we lost you a little bit. I don't know if everything or only me, everybody or only me, but I didn't hear you for a little while. Okay, so. The definition from the definition. Okay, perfect. Let's go back to the definition. So essentially the first definition of a value function is basically identical to that of a state function, but so it's just the expected value of the sum of all the future reward discounted starting from the position S, so from the state S. But now instead of always following the policy, the first action is taken and it's exactly that of which we are evaluating. So the Q value for a state and action is I am in the state. I'm doing that action first regarding what my real policy is. And then afterwards and only afterwards I will follow a policy. So basically it's the same as the value but the first action is taken. As always there are many equivalent formulation for this value function and one for example is just that the value function if you think about trajectories is the expectation value of the first reward plus I'm discounted the value of where I end up. So I'm in a state S zero. I do an action A zero and then I will be in some state S one and from there on I know that the expected infinite future reward discounted is just the value of that state. So I can rewrite it as the first reward plus discounted value onwards. Then another formulation could be which connects only Q value function with Q value function is just that I'm doing, I'm in a state, I do an action and I get the reward plus from that state from state S one I'm just doing all action which are available with the probability given by the policy and I'm evaluating the value function for all the pair new state and possible action. So which is just what it's written here. So I'm summing over all possible actions on the new state for the probability of taking that action which is just the policy in that state and the value of being in that state and doing that action. So as always, we will have different formulation which are equivalent but will be more useful in one or another case. So first thing I wanted to do to bridge last lecture with this lecture is just do pure value evaluation. So I will give a fixed policy. I will not try to optimize it. I will give a fixed policy and I want to evaluate it but I will evaluate it as a Q function. And I will use the third formulation which says that I which connects the value function for each for a state and action with all the value functions for the new state and all possible value actions there. So this is just the class. You will see that it's the same structure as the last lecture. I will try to maintain the same structures of all the main, let's say code parts. So it's easier to go back from one to another. So this is the Q evaluation. And again, what I want is just I'm asking for the gamma for the discount factor. I'm asking for the system size. Now while the value had the same size as the state, now we have a size which is the state multiplied by the action. So I need to have both the space size and also the action size of the space. I'm getting the policy because I'm trying to evaluate the policy. So I'm asking that from input. Again, that is the learning rate as always. Now I'm defining this zero matrix with the, this is just a fancy notation to say that the size is the size of the space and of action together. And again, I will, this is the most important part which is the single step update. So remember what we are trying to do. We are trying to get the old evaluation of Q and I want to do the new one. And in TD zero, if you remember, if the, okay, to the function, I'm just getting the state, the action. So we, this was like S zero, eight zero, R one. And then the new state S, S is one. This is what we give to the algorithm the last time. And if you remember, as in the last time, if the episode was finished, then in, for example, the template difference zero, we had that the delta was the reward plus zero minus the value in the state. We have the same thing, but instead we will, I call it delta Q to differentiate it. And I have the word plus zero minus the value, the pair, the value of the pair, state and action. Okay, so basically you see that in a sense it identical instead of having the value only for the state for the pair state action. And then if the episode it's not done, so if I am in a new state from, from which I can do actions, if you remember in TD zero, I had the delta was the reward plus the discounted value of the new state minus the value of the old state. And here what I have is the reward plus the discounted. And now instead of having only a volume in one state, I have to sum up for all possible actions and the value of the new state for all possible action. This is exactly what it's written here. I'm in a new state and I have the sum of all possible action in the new state, which in, in, in Python is just done with a dot product between the Q values in the new state and the policy I have there. So this is the policies of the probability for all the action there. This is the values for all the state action couples. And then I minus self Q values for that, for the old state, the old action. So you can see the, the parallel between them. And as always I'm asking also I'm, since I want to do trajectories, I'm also added the function, which it says, okay, give me an action following the probability if I have the policy if I give you a state, which is just a random choice of the, of the indexed actions while the policy is that of the state. Okay. Okay, so this was just a, it's a bridge between value evaluation with D and value evaluation with Q. It's, I thought it would be useful to have it as the first input. What are we going to solve today with model field control? Gridward, as it's a gridward again. So it's a squared word in which you can move up, down, left, right. It's a bit different from the other one, but basically it's the same. There are transition again, you can move left, right, up, down. You cannot move outside of a box. So you will say put. It's deterministic in the sense that there is no random probability with you strange things. If you do an action, you will end always, always, always in the same spot. If you try to go out or do impossible thing, you will be stuck there. Now we added a cliff. So you will see then we have a region which if you go there, you fall down and you are immediately transported to the start. So also the start is a new thing. We always start all the opposite in a single point. We try to move. If you go out of the cliff, you go back to the start and it ends only if you are at some point which is the end point. Okay, I'm creating the word which is like this. So you can see that it's basically a rectangular word. We will start in the left bottom corner and we will end in the right bottom corner. All move actions cost one. But if you fall from the cliff, you break your legs, it costs 100 and you start from the beginning. So do not fall from the cliff. It's the main message of today's lecture. Again, to make a parallel with last lecture, I try to create a class of the environment because this is the general gist of our environments are done like in an open IE gym. So it's a very, very simple one. You have an init function which just says, okay, create the environment. It asks for a word and it asks for the start and end. Okay, so good, the shape it's basically, it's a very simple shape as this. The start position and the end position as given in the beginning. And you have a value which just says, okay, I'm going to create a class of the environment and I'm going to create a class of the environment you have a value which just remembers the start value. And of course you have a variable which says, okay, the episode is done or not. You have a reset function as always in this environment which just starts to go back to the start and put the episode down to false. And you have just a step function with an action which takes the current state and tries to do a new action. And it tries to do a new position. And as I said, all moves cost one, but if you fall off of the square you stay put. So you go back to the new state actually goes back to the old state. And if the word in that position is as a minus, so it's you are in the cliff, then you go back to the start and you get the very negative reward which is the value which I put in the cliff which is minus under. And it just saves in the memory of the position. Okay, so this will, okay. To beginning, as we said, we did not do any control we want just to value evaluation. So what policy we want to evaluate? I just put a random policy which is equal to all places. So it's a policy which is whichever is the state is always that tries to do the same thing. It will never go to, at the end, I changed it a bit. I apologize if this is the actual policy I'm using. We can just try to 0.2. And okay, let's just look at it here. Basically, this is the policy for each state. I'm moving down with a probability of 0.2. I'm moving up with a probability of 0.3. I'm moving right with a probability of 0.45. I'm moving left with probability 0.05. Okay, just the policy I put. Then I create the environment. I create the gamma and the learning rate. And what I'm trying to do now, really, I forgot to put on the right, okay, it takes a bit. Okay, so what you see now, it has ended the episodes. Now, we did the trajectories, so we did 100 episodes. We initialized our evaluator. We are going through 100 episodes. Every time we start from the start, which is bottom left, and we take the state, we take the action, we see which action, so up, down, left, right correspond to the index of the action we took. And until the episode is done, I'm just doing new state, I get the reward. I'm evaluating the, I'm giving the state the action, the reward, the new state to the evaluator and so on and so forth. At the end of all, I've accumulated the values, the Q values inside of the class and I'm asking for them. So this is just, now in Q values, I do have the Q values evaluated for that single policy for that work. Now, we can watch them. So these are the Q values for being in a state on the right, for being in a state and doing the action down. So you can see that if you are just above the endpoint, so here, the value is minus one, which makes perfect sense because if you are there and you go down, it costs one because you move, but you have ended. So that is the most, like the easiest thing to understand. Then if you are close to the cliff and you go down, you will see that the numbers are extremely negative because of course, for sure you will pay 100. Then the numbers are not perfectly, I mean, they are not converged yet, but it's clear that if you are there, you're doing a down action, it's a bad thing. If you do an up action, you can see that basically they are negative but not incredibly so because it means that essentially you will never have a probability of going to a cliff. Then if you are going right, again, it's not that bad and if you are going left, also it's not that bad. Okay, so these are not converged, but just to present what it does usually. So you will have, instead of having only one map of value function, one for each state, you will have as many maps as many action you can take. So if you want to know the value in that position, you actually have to compare all four action maps. This was just that we, now we have done basically what we have done in last lecture, but with Q. This is just what I wanted to show you. Now let's go to the new part. We're going to do time different control. So we want to find the optimal policy. This is the most important thing. Why do we use the Q value instead of the V value? Because now with the Q value, it's very simple to find the best action. In the sense that if I had the optimal value Q star of being in a state of doing an action, the optimal policy is trivial. Okay, the optimal policy is just that policy which does the best. So if you take all the possible pairs state action and you take the one with an higher value which is defined as here. And okay, so the optimal policy is just the action that maximizes the state action value. So this is a very simple thing to ask if you add all the proper Q. This is a very simple to ask. Okay, I know that I have four action which does the largest value, do that. This is why I want to use the Q value function. But of course we do not have the optimal value Q star. Otherwise, why are we even talking? We have to find it and we will find it iteratively. And all of the different algorithms are basically functioning the same way. So it's an iteratively process of prediction of the Q and making a better policy, better prediction, better evaluation of the Q, better policy, better, okay. So iteratively, we do a random guess at the beginning of the Q. We create a policy with some properties using that Q. Then we create a new estimation of the Q using that policy and then a new policy with that Q and so on and so forth. Okay, and at the end, hopefully what we have as an estimation of Q will be a good approximation of the optimal value of Q star. And then the policy, we can ask for a greedy policy. So we can ask for a argomax. So the best action to do following that approximation and that will be our approximation of the optimal points. As I said, all algorithms follow the same principle but they will be slightly different in how to estimate the Q star function. Why are we doing this? Why is it not a trivial task? It's because of the delicate balance between exploration and exploitation, which means which is something you also have seen in class. The idea is that now we want to exploit our knowledge to find a good action to do, but we want also to do the action which we do not think that they are good because we want to still gather more information in the world which at the beginning, we have no information on the world. So the policies we want to use to create trajectory, need to balance two different drives. One is exploitation. Exploitation means using our current information to do the best action possible. And in particular, exploitation means to be greedy, means that if I have an estimation QT, a time T of the state S T, being greedy means I will do what is the best thing to do given that estimation. Clearly, this is not a stupid thing to increase the future reward. You get good reward, but the problem is that the current information could be wrong or incomplete. So in the end, you will be for sure, not for sure, but I mean, it's very high chance that you're not doing optimally. Exploration is the other thing, which means that you will do actions and you even think that you are doing action which are poor, which are suboptimal, but the possibility is that this could lead to a larger information gain and this eventually will lead to better rewards than experience so far, okay? Of course, if you already had perfect information, all you do with exploration is waste time, okay? So there is a very simple way to understand this with ice creams and it's that exploitations. So let's think you have never tasted any ice cream expect for watermelon and lemon and you decide that watermelon is better than lemon and you say, okay, from now on, until I die, I will only get ice cream to watermelon, okay? I don't care that there are many other, I know that I prefer watermelon to lemon, so I don't want to make any mistake, I will only take watermelon. This is clearly the worst idea of your life, but on the other point, exploration is like, okay, I know that I love chocolate, I always love chocolate, chocolate is the best thing ever, but since I'm not entirely sure that chocolate is like the real best, every time I go to an ice cream, I never take chocolate and I take a different flavor, so at the end, you will have nuts, melon, mint, strawberry, puffo, once I add the gorgonzola and nuts, okay? But the problem is that at the end, you may find out that actually, you prefer chocolate all along, okay? So balance these two things. A simple way to balance these two things, that you saw, which is very common, is called epsilon greedy, which means essentially you take a probability epsilon, which you should be rather small and with probability one minus epsilon, your reaction is greedy. So it means that you're doing what you think is best with respect to the current information. But with probability epsilon, you're doing pure exploration. So you do a random action between all possible random, okay? So the probability, the policy epsilon in a state is with probability one minus epsilon above, it's like the action which maximizes the Q value for the pair state action. And with probability epsilon, it's any action at all, okay? Epsilon greedy policies are very well-known in studying. The problem is that if you have a constant epsilon, it means that you even at convergence, even when everything is known, you will still have an epsilon probability of doing to work suboptimally, which is a lot, which means that your regret grows growth linearly with time. So it's not the best thing to do, but it's the best thing to start. Okay, that said, let's go to the technical part of our first algorithm, which is called SARSA. SARSA is what is called a non-policy TD control method. It's very straightforward. We said that all these algorithms add two elements. One element is a way to update the estimation of a Q value from one step to another. And the other part of these algorithms is a way to construct a policy. So the way to update the current estimation will be done with the time difference error in Q. So we will have, we will see how we define it, but we will have an error or time difference error in the Q value. And each step, we will see that the new estimation of Q, it's equal to the old estimation of Q plus a small part of this error, which we have just calculated with one step. And then we're using this evaluator Q, we will have, we can construct an epsilon greedy policy, which of course will be different at each step because the QT will be different at each step. And again, we will use what we defined before as an epsilon step. So the two ingredients of the SARSA is that my update of evaluation is done with the time difference error in the Q value and my policy, but as a policy, I just take epsilon greedy. What does SARSA need to work at each step? It needs the state where I started from, the action I took, the new reward I gained while I was in the new state, and also the next action I took because I want to compare state action pair. So instead of the normal TD zero, we saw the last time, we also need the new action. And if you put S, A, R, S, A together, it's SARSA, which is the name of the thing. And the update is calculated as the temporal difference error TD zero, and it's just this one. So you can see that delta Q is equal to the reward I got plus the discounted Q value for the new state and the new action which I took because this is the action which I took minus the Q value of the state, where I was in the action I took before. So differently from before, before I do this update, I also need to take a new action. I need also to be in a new state and take the new action. This is similar to, for example, what we did in the V value evaluation with TD zero, which was the R plus gamma V minus V, and this instead of having the new state we have the state action pair new and the state action pair old. Okay, but there is this parallel. And the update is just written before. So a second convergence caveat, which is before we said, okay, if you do a constant epsilon, it's suboptimal especially a long time. If you do a constant alpha, you will never converge. You will fluctuate around the true value. And as you've seen in class, actually there are, for example, two criterions which ensure that the value goes to the optimal value with probability one with infinite time, which is that the sum of the squared of the terms that should converge, but the sum of the terms of the learning rates in time should diverge, okay? This is the pseudo code, which will now translate into a code, which is very similar to what we have done before with evaluation, but we will have one step farther with the policy. You have some algorithm parameters which are the learning rate and the epsilon, the epsilon for the policy and the learning rate for the update. You initialize Q, we initialize Q to zero. And we also remember that the Q of the value of the terminal state is zero and will stay zero forever. And then you do episodes. And for this episode, you start with a state S, you choose an action, you take that action and you have a reward in the new state. And then you choose an action again, it was in the new state before doing the update and then you do a update. And the update is just that you take the Q value and you sum the delta Q of it, okay? And then you start again, you already took the action, so you go in the next state, you take the action and so forth and so on, okay? Where this seems that there is only prediction, no, there is also the fact that the policy is improved because you see now we have epsilon greedy and these epsilon greedy are constructed using the current Q, okay? So the prediction, so the better estimation is done in this line, updating the Q, but they actually control, so getting policy which are better and better is done because the epsilon greedy changes as a function of time because it depends on the current estimation of Q. Good, so now let's go to the code. This is a very simple Python code. I created a control, Sarsa control class. Again, I just need a gamma. I just need the size of the system because I need to initiate the Q values. I need the learning rate. I need just to create zero matrix of the size of the state multiplied by the actions. And then as always, I have one simple part with algorithm which is the single step update. Now I need state action reward, new state, new action. And what it does, it does mostly the same thing as before with a slight change in the sense that if the episode is done, again, as in TD zero, I just take the new reward, zero minus the values of the state action pair I was in. But if the episode is not done, I do the R plus the discounted Q value. And this is not the Q value, all possible action. No, it's the Q value of the new state and the new action minus the Q value of the old state and old action. And then the update is essentially the same as before. I'm just adding a small percent of this delta Q error. If I want to get an epsilon greedy policy, I'm just adding it as a function, sorry. What it does, I'm asking for a random number from zero to one. If this random number is smaller than probability epsilon, do whatever you want. So it creates the probability of actions which is flat over all the action space. If not, then I have to find first what is the best value which is this, the best value is the maximum value for the Q values for that state. Now, since Python returns only the first one but we want to share the action in all possible action which are the same, I added the second part which is a mask which says that the best actions which may be one or maybe more is just all the possible actions which share the best value. So I take the Q values in that state. So it's a Q value array for all actions. And if any of those it's equal to the best for sure one is because I took it as that but it could be more than the best action which is just a mask which says you're not, you are, you are, you are, you're not, you're not. And then the probability of action is just, it's a mask again but now it converts and floats, it's zero if you were not the best value and one over the number of best action if you are. So essentially this is a simple thing instead of doing I want the best action and I'm asking what is the best value? How many of you have the best value? Okay, the probability of among you is a uniform, okay. But the main point is if you have, I have epsilon probability of doing anything you have one minus epsilon probability of doing an action which is the best of one of the best if there is a tie between them. And then using these probabilities I'm just taking an action, okay. Then I can also ask, okay, this was an epsilon greedy. What is the greedy policy? The greedy policy is just doing this. So I'm asking for the arg max for that Q value in that state. Notice that again this is technically wrong in the sense that I should ask for again splitting the probability of ties. I do not care because right now the greedy policy means that I just finished. So I'm fine with only one of the possible greedy policy. If I wanted to keep on learning, I need some exploration so it was actually much better to have this split, okay. So notice that here there is this slightly incongruence if you want, but the greedy policy is defined as this. Okay, let's do control. And then maybe we can take a break or see how it is. We're doing 2,000 episodes so we need a larger number of episodes. We initialize the SARSA with some learning rate, some epsilon and now we also keep track of the performance. So for each episode, how many steps or what is the final reward of the episode? And then I do a run over the episode and as always I'm doing the same thing but now the only difference is that I have sort of a state then the action and this is because the action from the SARSA is just 0, 1, 2, 3 and the action is as a vector is like the arrow of direction. Okay, so this is a map between 0, 1, 2, 3 to move left, move right, move up, move back. If it's not done, then so this does a whole episode and takes the new states are done and acts. Then I save the performance. So I save the reward in the performance. I get the new action and only then I do the single step update. So you can do the single step of state with the state action, new reward, new state, new action and then back and back. Okay, so first of all I need to, I always forget and apologize to, okay, it's calculating, good. So it calculates. So 2000 episodes maybe was an incredibly stupid idea with Zoom on. No, okay, good, good, good. Okay, now you can see again, this is what happens. So the Q values for action down are as such. The Q values for action one are as such. The Q value for action right and Q values for action left. So again, you cannot see only one map of value but you have as many maps as. But the most important thing I wanted you, I wanted to show you is this. Okay, so this is now, this is the word above. This is the best Q for each state, the best value. So I'm already doing the max over the actions and here you can see the action. And the action actually say, if I start here I have first to go up, then to go right and then to go down. Okay, this is what Sarsa learned. And now something should be slightly off for you because I said before that here, the words were translation movement were deterministic. So if you decide to go close to the cliff, actually if you decide correctly to go all the way right to the cliff you should not be able to fall off. So instead Sarsa does this thing which goes first up then right and then goes down. And now maybe before closing to the break we can discuss why is it. And the reason why is it it's because Sarsa is an own policy method. And so it means that it's doing the trajectory. So it's following the trajectory with a policy and it's evaluating the same policy in which it's following. So we are not actually, we are trying to get the best epsilon greedy policy possible, but we are still evaluating that epsilon greedy policy and epsilon greedy what it does is it has the chance, epsilon chance of doing a random choice. So actually Sarsa says that in the word of epsilon greedy policy, since you have an epsilon probability of doing things randomly, you should not ever go close to a cliff. You should go up, right, and down. Why? Because I'm learning the values for the policy. I'm following and my policy I'm following is actually dangerous close to the cliff. And here you can see that epsilon greedy function while they are a good tool because they give a balance between exploitation and evaluation and exploration. Actually with Sarsa as a double edge because then you are optimizing that kind of policy and that kind of policy can lead to Q value which are not exactly what you want. So with that, perhaps we do 10 minutes of break and then we see something which is like different. So it's an off policy. So where the policy I'm following and the policy I'm evaluating are two different things with Q learning. And then if we have time, I hope so we will do expected Sarsa and a bit of convergence. Okay, so I can, I apologize, very well. Yes, exactly. So now we are doing epsilon which is finite. At the end, we will see that we are actually taking epsilon and we are changing in time and we will see that it will convert. So epsilon greedy with constant value are suboptimal and that can be suboptimal in a very visible sense as you see. But clearly there is a way to change them on time and which again is proven to be optimal. Okay, let's stop. Stop, pause the recording. Please resume the recording. It's recording now. Perfect, many thanks. So, okay, so far we've seen that we went into the Q value. We saw how to evaluate the Q value given a fixed policy. We saw the first method which was Sarsa which was iteratively evaluating Q with a time td error delta Q and then it was choosing epsilon greedy function to explore and exploit and so forth and so forth. And we saw that it's a good way. So it learns in a reasonable way to go from the start, bottom left to the end, bottom right. It's especially, at least with an epsilon which is fixed, it's not optimal, but it's still a good solution. It's not doing random things. Now we are going to the second of this very simple class of argument which is called Q learning which is off policy. So again, it's very simple. It has a policy which you use to explore which is again the same as before. It's epsilon greedy. And exploration means that epsilon greedy we recall it's just that with the probability epsilon which if we've probably epsilon do a random action between all it can, we probably when minus epsilon it does a degree D so it does the best action with the current evaluation of Q. So what is different? Many things are similar to Sarsa in the center at every step t we store the old state s t we store the old action a t we store the new reward r t plus one the new state we can't see your screen by the way. Yes. That is very simply because I forgot to share. Okay. Thank you very much. Fortunately, I was just reiterating stuff. So we are now Q learning and the idea is that again, we explore using an epsilon greedy function which is the same as defined before and many things are the same as in Sarsa. So with every step we have old s t old a t new reward r t plus one new state s t plus one. Then we do not need to take the new action because what Q learning does is that the new value of the Q value which he does for the TD error is using the best action using virtually of course the best action available for that state. So instead of being Q d t of s t plus one action a t plus one so the action it takes the action in the trajectory it does something which is separate from the trajectory and it's considered the value which is the best possible value given the estimation Q t. Okay. So the only difference with Sarsa which is a large difference is that update of Q is done with this term here. Why is it off policy then? Because we have a mental separation between the trajectories which are done following epsilon greedy. So in the trajectories at least epsilon in epsilon percentage of the action you are doing random things. You're doing non arg max over action but in evaluation actually you are only evaluating the new state part which is not done with probability state action but it's done using the arg max. So in a sense it's more like evaluating something which is closer to a greedy but exploring with an epsilon greedy. So it has a shift between what is evaluating and what is used for exploration. The two policy are not the same. Okay. This is what it means with off policy. The algorithm, the parameters is as before you have a smaller in grade, a smaller epsilon. You have, wait, I wanted to let you, sorry. We have this here. Okay. You initialize your state. You're taking an action from the policy which is epsilon greedy. So this is the policy you're actually using to explore which is the epsilon greedy. Then you take the action, you observe a new reward and you serve a new state but the update is not done with the probability given by the policy you are actually following but it's done using this greedy choice of action. Okay. So the value is updated with max over action which is a greedy conception of the policy you're evaluating. So the policy you are evaluated evaluating and the policy you are following are two different things. And then you do it over and over for many episodes until the end. Okay. Why is it off policy? Again, I said it, but it's very important because before in SARSA you were using the update in the queue using the new trajectory state, ST plus one, AT plus one. So you were evaluating the policy you were following. Here you are evaluating a policy which is not exactly what you are following because you are following epsilon greedy and you are basically evaluating a greedy. But as always, the difference in... Now the host is me again. Good to know. Again, the structure of the algorithm is basically very similar as before. Again, it just needs the gamma and needs the space and it's the learning rate. It needs to initialize the values. I initialize to zero. It could be initialized to other things. There is also a discussion about the optimistic evaluation, et cetera, et cetera. I do the most basic thing and initialize everything to zero. The single update is again extremely similar to SARSA. I do not need the new A. So I do not require as input the new action. What I do is if the episode has ended, I just do the reward I took plus zero minus the Q value for the pair state action. If the episode has not ended, I have to find the best Q value available at the new state. So I go to the new state. I look at the Q values in that new state which is an array for all action and I took the best. I do not take the one I will follow in the trajectory. I take the best. And then the delta Q is just reward plus gamma. So the discounted best over the new action available minus the old Q value. And the update is exactly the same as before. So you see the structure is the same as before. These two line of code changes from being our own policy to SARSA to one of policy Q learning. Again, if you want an absolute policy making, you have to ask for a state. You have to ask for a website. This is the same as before. Probability, epsilon do whatever you want randomly. Probability one minus epsilon, find the best value, find all the action which share the best value and then do a flat probability of the reaction which have the same best value. And then do a random choice in this probability which is either completely flat or it's a flat only for the best action. Again, I define for convenience the greedy policy which means that once you have the Q values, once you want to say, okay, I just want to know what you arrived so far as a greedy policy, you will get it. And then I have to compile it. Otherwise I will ask many, many times I will do wrong. What I do now, it's the same I did before. So I am trying to make a Q learning control 2000 episodes. I'm just initializing this new algorithm here. I'm just taking track of the performance. I'm just running over the episodes. Every time I'm resetting the episode, every time I start with the current state, I do an action, go on and then until the action episode is finished, I just take a new state, the new reward. Am I done doing the environment step? Keeping track of the performance, taking the new action, looking what is active in the new action is in left, right, right. Doing the single step update. Now it's not Sars anymore, but it's Sars because it does not require the new action in the trajectory because it takes the maximum possible overall possible action it has. Okay. Then again, I'm asking what are the Q values you arrived to? And then you can see that here, this is actually similar to before, but with two main differences. The first one you can already see visually, which seems that there is much less variance here. Okay. Here there is more variance, okay? So you can see that it does not seem to be a constant gradient and everything. Here it seems to be more well-behaved. And also you can see that something which is very cool, but I would give them. Okay. You can see, let's see this. Okay. You can see that actually has learned the, already learned the optimal policy. And you also can also see that the value for the optimal policy is just minus one, minus two, minus three. Okay. This is not to converge perfectly, but the policy has. But in any case, you can, at least for some of the trajectory, they are correctly, correctly, the value is correctly describing the distance between the end point. Okay. Now something which can be counter-intuitive and it's arise, it's coming from the fact that you have an on policy and off policy. So remember, SARSA did not find the optimal policy, but it was a non-policy. So it was evaluating for the policy it was following. You're learning found the best optimal function trajectory, but it actually was following a different ones. What have I done? Okay. I hope you're not hearing what I'm hearing. Okay. So what is this? This is just what I stored. So the trajectory, the performance in time. So orange is SARSA and blue is Q learning. And what you can see, you can see two things. SARSA is much more consistent towards a higher value, but it never reaches the highest value which Q learning does. But Q learning actually has a much worse performance during the running time. Why so? Because both are actually following epsilon greedy policies. So both of them sometimes do random action and sometimes those random action go to the cliff. SARSA has learned to optimize an epsilon greedy. So it has learned to optimize that policy so that to minimize those actions, following the cliff during the trajectories. Q learning has used those trajectory and using those trajectory to optimize and find the best policy which is separate to it. So Q learning has a much worse performance because it is following trajectories which follow the cliff, but do not care about those which follow the cliff because it knows that actually you could not following all the cliff. So he actually learned to go over the cliff. In a sense, the best performance of Q learning is better because he has found with probability Y minus epsilon a better optimal policy, but since they're both following so the trajectories are given by the epsilon greedy policy, SARSA is doing better because he has optimized itself for the fact that you have the epsilon probability of doing random stuff. This is just to see the main difference between them. SARSA, on policy Q learning of policy. Okay, let's do one more step to expected SARSA. Again, new word, but actually two lines of code. Expected SARSA is a very simple modification to the SARSA. It's helpful because it reduces some of the variance of SARSA because SARSA, if you think about it, needs every time you have pairs of state actions. So if the action space is very large, then every time if you are in a state S prime, you have, you need to all the time, I mean, you have too many trajectories to check all the possible actions. So a way to remove that, to say, wait, if I am in a new state ST plus prime, I'm not asking for my trajectory where I'm going. I'm not trying to find as the Q learning the best thing I can do, but I will try to find the expected Q value following the policy I'm actually following. Okay, it's a bit complicated, but it's actually the only difference with SARSA and with Q learning is that the temporal difference error is calculated as such. It's the reward and that is a must. You never, everybody share the reward minus the Q value for the state action in the past, that is a must. Then in between, you do not have the Q value for the state action ST plus one, AT plus one, which is SARSA, which it means I looked at only one trajectory. It does not have arg max over AQ of ST plus one, A. It does not mean I'm not hypothetically doing the best thing I'm doing, which is what Q learning is doing, but actually it's doing. I know that since I will follow my policy and I know that my policy epsilon, I know exactly what are the probability of doing that. I'm something of over the policy over the action, the probability of doing that action, following the policy I'm actually following, which is an epsilon greedy policy, and the Q value of being the new state and doing that action. So SARSA has one trajectory, one new action, and it does that and it does that both in real world. So it does do it as a trajectory and it does as an update. Q learning as a trajectory with one action, with one policy, but it actually does not care that what action it is. It does the update with the best action it could do. And SARSA does a bit of both and says, I'm a new action, I'm taking all the possible action I could take with the probability of taking them given my policy, okay. And this is what is done here. So it's average, this is why it's expected SARSA because it's expected value, it's the average value of the Q of new state any action, even my probability of taking an action even in the state. Okay, again, how we've done the update once you have this number, same as before, okay. So this new class expected SARSA to control, it's the same as all other classes. First it does this initialization and then the only part which is different is this. If the episode is done, I have reward plus zero minus self Q values of the pair state action. But if the episode is not done, I'm doing R plus discounted the inner product. So sum of action of the product. The Q value for all the actions in that state, the policy for all that action in that state, okay. Now the only thing which is a bit different is that not only I need to have a get action epsilon greedy which returns one action with the probability given by the epsilon greedy, but I need to have a function which returns the actual probabilities, okay. I could have done that here since I already calculated them as a, no, no, no, it's not to, I needed them. So what I do with function policy is take the current Q values evaluator and it does the probability with the policy in that state given the evaluator. What it does is first of all, I know that with probabilities of epsilon, I will have a flat probability. So I first of all, I take an array of ones, I divided by the action size. So this is a flat probability of one and I multiply by epsilon. So this is a small epsilon probability which is completely flat. And then as again, I'm taking the best value. I'm taking the best action. Let's say I have two equivalent actions. Those actions will be divided by two. So I will have zero, half, zero, half. I multiply them by one minus epsilon, okay? And I sum both. So I sum this small uniform probability epsilon with some peaks of one minus epsilon divided by how many good best choices I have. And this is my overall probability distribution considering everything. This is my probability distribution of the policy which I can calculate given the state and given epsilon and which goes here. So this is truly the expected value given that I am a new state. And I know the Q values for the state action and I know the probability of being in that state and doing that action given that epsilon. Okay, you see, essentially the same as before but now we have three different ways of thing. Sarsa trajectory, I'm following trajectory. I know what I did. I don't do the update with what I did. Q learning, I follow the trajectory. I don't care what I did. I know what would have been the best thing to do there. I'm considering that for the update. And expected Sarsa, I have a trajectory only one because I'm only a robot, I have so many artificial health. But I know that if I were there I could have done this many action with this probability. So I'm averaging out for my update of the Q value. How do they perform one through each other? In this particular case, not that differently, I apologize. But so we have an expected Sarsa control. I'm doing 2000 episodes. I'm creating this class. I'm taking the epsilon as before. I'm taking note of the performance. I'm doing the run of the robot. Everything is the same. Actually, everything is the same as in Q learning because I do not need even the A prime because the single update requires only state, action, reward, new state. What I requiring more is the epsilon. Why? Because in an update I need to create my policy. To create my policy, I need to know what epsilon is. Then I run it, wait, I have to run everything before, hopefully it's not taking too much. Okay, again. So this right now is something which is in between the two. So should have less, it should be on policy and should have less variance than Sarsa. Because Sarsa, you have only one trajectory. You have to wait to accumulate all the possible trajectory. You know what, I know what the expected value of where I would go is, I use that. Again, however, you see the actual value, perhaps it's a bit better than the performance, but then Sarsa, but it's basically the same. And you can also see that it does not reach the same height of Q learning. So again, this has learned a suboptimal value. The same problem of Sarsa in the sense that it's on policy. In the sense that it's actually using it for an update, the policy is following. Good. So far we have learned control in the most basic gains for tabular systems with temporal difference in the models. Yeah, I just, yes. I had a comment. So on the book of Sator and Bartha, you actually see examples worked out where you can appreciate the performance of expected Sarsa, which is better than Sarsa, especially with respect to the learning rate, Alpha, which is something that has not been changed here. So for reference on the advantages of expected Sarsa over Sarsa, you are referred to the book. Thank you. I guess you're muted. Yes, I will say that I can hear it. I think this is what you were... Yeah, exactly. You, so this is Alpha, it's the learning rate. And you can say that with Alpha, it's very small. They are rather similar in performance. With Alpha, very large Sarsa, it's rather thrown off and expected Sarsa, it's actually much better. So this was, I think, was the better point. So they are not, perhaps I didn't want to say that well, but yeah, they are not perfectly equivalent, of course. It's just that in this particular case, it's not easy to spot the difference. Okay, last thing I want to deal in a very basic thing. You can read it in two ways. Either I was lazy in preparing the lecture or I wanted to give you many exercise to do home. I want to briefly touch on the problem of convergence. So far we use two constant values. We use constant learning rate, Alpha, which is bad, which is, it's bad for a very simple reason because the values will not converge, which is not something we want. And instead, they will oscillate forever on that. You can also see it here. I mean, it took 250 episodes to obtain basically the same performance you will, it saves the same for 2000 episodes, okay? So it means that after a while, this is not perfectly, I mean, the convergence, this could be related actually to epsilon, it doesn't matter. It means that still the conversion is not that great. On the other hand, we know a solution. The solution would be to implement some Alpha, such that the sum of the squared converges and the sum diverges. The problem is that this is a mathematical limit to infinity and limit to infinity have the problem of being to infinity. So generally speaking, you have to find a balance between having something which, yes, it's true that it converges, but it should not converge too fast. And, but at the same time should not diverge. So I mean, you have to find a balance. Some way of doing that, which is not the best, it's something which I just suggest and it's done sometimes is that you could use a constant Alpha up to some point T and then use some function which go something like that in the sense that they decrease with some power of the time, but they are constant up to some point T star and then start to decrease. And also they start to decrease, but the rate of which we start to decrease it's doable by this K. Okay, this just means that it is true that this is a proof, it's proven fact, but if you take any, sometimes if it's called the learning rate schedules, if you are scheduler, if you take any scheduler and you try to make it converge in your lifetime, sometimes does not work. Second thing, epsilon greedy policies are bad because you are doing suboptimal things and we have seen with our own eyes with SARSA, it learns actually the wrong path, the wrong shortest path between step to end. Again, you have seen in class one of the possibilities just that the epsilon goes down with time. One other which I think I was referring to which is a more interesting part is that you can take count of every time you have visited the state action pair. In this case, it would be feasible because now I just entered it when I do an episode, I just entered a new count array of dimension state space action, which is rather small, it's four, 12, four again. And then I do a plus one every time I do in that particular couple SA. Then I have a number which indicates how often I visit that particularly couple state action. And then it means that if I have not visited, I want to be as exploring as possible because I'm entering a new room, perhaps I want to know what I'm going to find out. But if it is, it's the same as, I mean, you can open your own fridge many, many, many times, but it will not change what it is. But if you want to find something in a supermarket, every fridge is a new story. Okay, this is basically what it will count how many times you're doing exactly the same thing and epsilon, so the probability of doing random things diminish by the number of you visited that particularly couple. You can clearly see what is the problem if you have a state space and action space, which is huge, then you will have visited so few times everything that this approach becomes not great. But then again, maybe any approach will indicate maybe any approach is not perfect. Okay, again, I have, I'm proposing because this is something which I will use more or less. Yes, which I'm using below some kind of formula, which is just this here. Decay after the time t-start with power slightly larger than one. Okay, let's do it. Same as before, I'm doing, but now I'm first of all initializing an epsilon, which was the epsilon from the start, which will be the one starting then decay and a learning rate from the start. And then I'm initializing the expected SARSA class with the learning rate zero. But at every, and I'm adding a t-start, so a time for my decay to start, and I'm adding a count. And during the run, I'm just adding one every time I do a step. So every single step counts one. And then when the step has succeeded t-start, I'm changing the value, which is the internal variable of the class expected SARSA to this value, the original value divided by one plus 0.003 to the power 0.75. Epsilon does the same. Epsilon is given then to the expected SARSA. So effectively, we will see the results for both learning rate and epsilon, which decay with this kind of powers after a certain decay time. What we see after a couple of seconds of tearing to a blank screen. Okay, perfect. We have, again, our usual values. The value say, please do not go down. The value see, okay, you can go up, but it's not a perfectly sensible thing. This is speaking a good idea. And going left, it's not a good idea. And then what you see is that this is Q-learning and expected SARSA performance during the simulation. You see that after some learning in which the performance for each episode goes from minus 40 to rather a long trip to the best value. And then it stays on the best value. And then sometimes still does different things because hopefully our epsilon is not so small that it will never do anything strange because the sum, again, the sum of alpha should diverge. So eventually by pure numbers, something strange should happen. So this is why at least once it does crazy things. And then, but most of the times it's the perfect thing. So it should have converged even if it did not go to infinite. And we can see perhaps that this is the best. Is the comparison done with Q-learning with fixed epsilon and alpha? Yes, yes, yes, yes. Sorry, I just took one which I already knew. So Q-learning, this is what happens with Q-learning without epsilon is not decaying. So this was the old Q-learning, okay? This was constant epsilon, constant learning rate, blue, which does find the best optimal thing, but it's following a policy which is rather poor because it has a large epsilon followed by SARSA, expected SARSA with learning rate and epsilon going to zero, which means that it's actually converged both in values should have converged both in values and in policy. And you can see, for example here, that it does find the optimal policy, which expected SARSA with constant epsilon of 0.05 did not find the optimal policy. So now it has. And also, at least for the points which are following the main road, which is like the optimal trajectory, the values are very similar to what we expect. So it's zero here, minus one here because it takes one action to go to the end. Two, three, four, five dot zero one because still has not perfectly converged, but still it's much better convergent. Six, zero, one, seven, eight, nine, 10, 11, 12, 13. So the Q values have converged much better than for example, let's see, let's take Q, perhaps also Q learning was not bad. Q learning was, you see Q learning was zero, one, two, five, six, seven, eight, seven, eight, nine, 12, 13, 14, 13, okay. He had some still some misgearing. This was Q learning, which learned the optimal value, but still the, it was rather fluctuating in the Q value. Now, learning rate going to zero, I'm going to zero, effectively, this has converged both in policy and in Q value. Okay, with this, we have the basic structure of what it means to do TD control for model three systems in a tabular case, which is the very basic thing, but still we have dealt with unknown models. What I mean by unknown models, I mean that these incredibly stupid algorithms which we have here, and I mean like the code, the class, if instead of having this, I will say that last time, instead of having where I did a defined, somewhere I define the environment to be grid word environment, right? But possibly two hours ago. If there, instead of saying, you know what, this is the environment is a grid word, I will, yes, here. If here, instead of having an environment equal to a grid word, something, something, the environment, anything which has the same basic structure of you can ask for, you can ask to do a step when you do an action and you get a new state, new reward done and et cetera, et cetera, this should work in the same way. This is an algorithm which does not require any information of the model, it requires only information on the trajectory. And of course it works only if the system is tabular in the sense there's a discrete action space, a discrete state space. But there is nothing, nothing specifically tuned to the fact that this is the grid word with a cliff and maybe it will not work if the grid word was working on it, no. You have trajectories, then you put them into the algorithm and it will work. Perhaps, perhaps it's too simple to achieve conversion, but I mean, it should work. This is what model three means, okay? Do you have questions? Can I make a question on the last plot? Sure, the last plot was, you mean this or this? No, the next one. This one? Yes. Yes. Why, for example, in the upper left part there are some action that seems strange that goes up. Okay, again, there probably will be again a problem of convergence. So this, let's say, look at the best trajectory possible, it's go up and then all the way right, okay? So two and then let's think of a trajectory which actually goes to the top left corner. First of all, you have to go up, which is what you will do. So even if an epsilon goes zero, you will eventually go there. Then the proper, the greedy action says to go right. Do you want to go up? Okay, you will have to wait an epsilon probability action to go up. Okay, maybe you're going to do it. With probability epsilon, you are going there. Then the epsilon greedy probability is saying that you have to go right. If you want to go up, you have again to wait for a chance, epsilon probability chance to go up, okay? And then when you arrive there, you will do something and you will have a very small amount of information on what you've done. Okay, what this mean? It's that the frequency of update in this top left corner will be much smaller than anything anywhere else. So perhaps for the information, it has achieved in this top left corner, which may be very few times of exploration. The Q value of going up, it's actually the best in all four, okay? And essentially, in a sense, it's not that surprising, I mean, going up, it's not even, I mean, you could do four actions and actually going right is clearly the best, perhaps going down, but going left or going up, you're just wasting one turn because you're actually staying together. So it founds an action which is wrong, it's not optimal, but it's not even that wrong, it's not falling off of the cliff. And of course, it's an action which has visited, perhaps so few times, which has not yet converged to the true value. Again, if you wait infinite time, you will be, you can go to Professor Celani, and say, look, I wait infinite time, this still is the wrong action, and then the theory will be wrong, okay? But again, the suggestion from Antony was perfect. If you calculate the number of the frequency, you guys visited those state and done those action, you will probably see that those parts there are absolutely not visited so much. I mean, the least bit of all. So once we reach the top left corner, we try to explore more, and this action is given because we didn't reach the top left corner enough times to say which is the best action to pick. Okay, thanks. My current intuition on that is, yes. Then there could be problems with the code on everything. I see it, I do not think it's strange, and I think it's exactly the best reason. We have not reached the top left corner sufficiently many times, so that it has converged, and you know that there is actually a better action to do. Okay, thanks. Okay, may I come in here and say a few things? So the first thing is that I accept your challenge. So if anyone finds out that there is a trouble, I'm happy to discuss. And the second thing, and here you must have been discovering the pattern already, so every question comes with an exercise. So one way to get around this problem is rather than starting all the time, so like it's from the lower bottom corner, you can start from random places in your grid, and therefore you can start sometimes also from a top, right, top left corner. In that case, these random restarts allow you to explore, force it in a forced way, and therefore your convergence to number of visits will be much better in that case. Do we agree, Amalaya? Good. I'm done here. So in the SARSA, the policy is given. So it's like you imply that in the environment, you already know the policy and we cannot do anything about it. But in the Q learning, it's like you are switching between policies because you don't know what is the fact. Let me clarify. Thanks for the question. Let me clarify something. In SARSA, the policy is not known. You can see in the code or you can see it in the code. It's the same. The policy is a function of that Q you have in that moment. So for example, here you see this epsilon reading. So when I am at certain time T, depends on the Q value I have in that moment there. So the policy is not fixed from the beginning. It's not fixed from the start. At the beginning, the Q value has some values. I start with zero. They could be random and the policy will be strange. Then as the time goes on, the policy is always different but always following this formula here. So the policy adapts to the Q value. So it's not fixed. If it was fixed from the beginning to the end, then it would not be controlled. It could not learn to do the optimal policy. But it changes in not in a strange way. It changes following always the same formula which is look at the Q values which you accumulate and change and store all the time. And from those Q values create the epsilon reading policy. So both things change from step to step. The Q value in the pseudo code, the Q value changes because you have visited new state and you have this formula here which give, okay, my Q value, I thought it was two. Now it's 2.01. Okay, this is the changing Q values. But since the Q value has changed, also the policy which is constructed on the Q value has changed. So in SARSA, both Q value and policy change together all the time. So it's not fixed from the beginning to the end. Perhaps it made it sound that it was fixed. What is fixed is how you get the policy from the Q values. Okay, when you have a Q value, you get the policy in this way here. But since this Q value here changes all the time, also the policy changes all the time. This is one thing which I hope it's clear. If this is clear, perhaps I have also another distinction which I wanted to do even more clear. So is it clear this thing? So that it's not a fixed policy, but the policy changes because the Q values changes. Is it clear? I hope so. So the second thing is what is different between SARSA and Q learning is that in SARSA, you have the policy, the Q value changes because this changes all the time. The policy changes with the Q values and the policy is used both for making the trajectories and both for making the update. So you have Q values change. Q values make the policy change. The policy create the trajectory. The trajectory make change the Q value. Q value change the policy. Policy make the trajectory, trajectory make the Q value. And this is gone. This is SARSA, which is why it's called the on policy system because they have the same policy to make the trajectory and to evaluate. Q learning has exactly the same thing, but with a small twist. So yes, the Q values change in time. The policy which are used for the trajectory, which are the same as in SARSA changing time because they are a function of Q value. So again, you do not have a fixed policy. You're changing everything, but then the policy construct the trajectory, but the update is not done using exactly the trajectory. It's done using a bit of a trajectory where I was and where I got and a bit of extrapolation in the sense that instead of doing the action I really took, I will use for the update the best action I could take. But still you started with zero knowledge on Q values and the policy was the worst policy ever. And as time accumulate information, the Q values get better and the policy which you're following changes. Again, it's not fixed in time. Was this clearer? Yeah. So there should be a time that the trajectory of SARSA and Q learning will be the same. Yes. If you change epsilon to go smaller and smaller and smaller and smaller and the learning rate goes smaller and smaller and smaller. So you are in the condition to reach convergence. The optimal policy for one and the optimal policy for the other, since they are the optimal policy, which is the property of the system and not of the method, should converge to the same. And you can see it now, it's not super easy. But again, this is, how can I do it? Wait. Okay. This was Q learning. This was expected SARSA with. Okay. Oh, I hate you. May I add one thing while you search for the pie? Yes, sure. Grid world is a bit tricky in this respect because as the trajectory is a worker in right angles, there might be more than one trajectory that is optimal. So it depends on whether you take the turn first and then go down or you take a turn. You go down and then you go left. So from the same point, there might be more than one optimal path. And this is reflected in what you observe as well. So one has to be really be careful when there is a degeneracy of optimal paths that connect your start point to the end point. Which happens if you are not starting from the lower bottom. Yeah, but if you start from any other point up. Yes. Something which you one could do as an exercise. If I want to follow Antonio example to make every question an exercise. Now, as I pointed out in my code now to calculate the greedy function, the greedy policy, which is what plot is here. I'm taking the best action with an arg max, which means that even if there were ties, I chose only one. But you can actually try to go to the convergence and select ties and plot both action together. You would see that perhaps what here it looks like slightly different policies. They would be much, much, much closer policy because again, as Antonio was pointing out, if you are, if you are anywhere not on the cliff, if you go right or if you go down should be exactly the same because you are both in both cases going closer one to the endpoint. So there should be a tie. But in my code right now does not in a tie for the greedy policy does not show both of them. But you can see that they have converged at least in most of the states and in all of the states composing the optimal trajectory from top down bottom left to bottom right. They have converged to exactly the same points. This here and this here. Are there more questions? It's a bit late. If you have, please do. But otherwise, yes. In the, in the expected source, the, the spectrum of policies is. Let me get in there. In the expected source, you have to have a spectrum of policies. In the spectrum of policy, what do you mean by that? Because in the czar say, the policy is just how to update Q. Oh, wait, by, by policy. I refer to the action to take given a certain state. So to me the policy. The arrows here. It means that if I am in this position, the policy is the probability of taking that action right instead of taking the Russian left. So here what they show is the optimal policy as converged for the two case expected the source on the left and you learning on the right, which means that I'm showing the action to take if you are in the state. If you buy a policy you mean what you do to change the update. I don't think it's generally called a policy. And there is a very, the only difference between Q learning and sarsas in how you define the delta Q so this update of a Q value. The, the values should converge. So the values for the Q of the of the couple state action in every state for each action with convergence criteria and with infinite time they will converge to the same numbers. Again, the values as you've seen here and you can see the second plot from up, you can see that they are actually not that far off in the sense that if you are right to the end. You can see that there is a zero, and it's zero in both cases then it is a one, which is good because it means that it takes only one action to go there which is probably one. Then it's a two, if you are a distance to the needs that is something close to three in most of the most of the place which are distance three. And it's not at convergence yet, but the two method will converge with probability one, if given enough time with decaying alpha and decay epsilon to the same values. They have not converged yet because of finite time and etc etc. In any way, part of the answer to your question. By taking almost no, but perhaps, if you have, if you want to continue, please write to me and any of you. I will try to answer to mail and luck and everything as quickly as possible. And sometimes it's very useful if you do a specific question also brought question for me, the sense that I know what I completely was of point of speaking, and for you perhaps, you do not need to spend two hours with the releasing election and I will answer in two minutes and maybe it will be clear with one to one. So we went late, so I apologize to everybody. As always, if you have questions, write, ask and everything. And thank you very much. I will stop recording. I should have done it.