 Excellent. Very good. Welcome back everybody. So we resume from previous lecture. So today we'll go through a quick summary of what we achieved yesterday and some more considerations on how to perform exploration in a reasonable effective way. Mostly focusing on bandits as our test that example. And then in the second part we move to the back to the full reinforcement learning problem. And we will just discuss a few kind of algorithms that can be adopted to learn without a model how to control an MDP. Okay. But first things first. Let's go back to our summary of results. So starting out yesterday. So what we have seen is that we can implement the general idea of control in absence of a model by basically setting up an algorithm which works as follows. So we, it's a loop by which we start out with some estimate for our Q function. Okay, so at the beginning could be just a blank slate or values set to zero or set to random. Of course, the point where you start from as in as a definitely a lot of importance in terms of performance in the sense that the closer you are to the optimal solution the quicker will your learning be. But you start from from some guests anyway, which might or might not encode your prior knowledge about the system. The first and most important thing is that from starting from that knowledge. You derive some policy. Okay, so this is a policy which is derived from your estimate of the Q function. And here there are several choices, and we will discuss some of them today. The main and most important idea is that if you had perfect knowledge of the optimal value function, you would choose the policy greedily. Okay, so you would pick out the action which maximizes your expected return. But the point is that this is just an estimate. So you're not allowed to do that. If you do this you fall into traps. Okay, so your experience might lead you to believe that you are performing that are choosing the best action but this is not in fact the case and if you just go full throttle for the option that seems to be the best, you may neglect other options that might turn out to be better but if you don't visit them, you will never know. Okay, so this suggests that this way of estimates of the Q function into policy at each step time step must carefully balance exploration and exploitation. So balancing exploitation, and there will be more of that soon. Okay, then we can sample our system. So we can observe, we can produce an action. And as a result of this action we observe rewards and possibly new states more in general, if it's a bandit probably just to observe rewards, okay, because this don't matter. But in general you also serve a new state. So given this observations, okay, so this is select an action, and this is observed. And then from this you construct your notion of temporal difference error, which is a measure of the difference in what you have observed and what you would have expected based on your initial estimate here. Okay, so given all the steps that you've done before given the estimate, you may also use the policy, etc. All these things together can lead you then to compute the temporal difference error. And then you use the temporal difference error to update your estimate. And you can you could do that by using some kind of eligibility trace so this is the part which involves some credit assignment. That is, you have to give credit or blame the states and actions that you think are responsible for this discrepancy between what you observe and what you predicted that is encoded in the value of the temporal difference error. And in the most simple situation, you solve this very simple credit, this credit assignment problem in the simplest of ways that is you give credit to the states and actions that you've just visited. So just what has happened in a very, very recent past. Of course, as you've seen you can do better. And how better this depends on the details, the properties of the underlying environment so it's not easy to understand a priori in general. All right. So, very good. So that's a general plan. And then we that that's a general scheme of the algorithm that we will be playing so this loop goes on and on with time and assuming that you appropriately reduce the learning rate in this step. So this is in this step, you perform your learning and in this step. This is where learning takes place. And in this other step here, this is where your updating takes place or this is what the exploration. Okay, you have to carefully reduce your learning rate and reduce your exploration rate so that eventually your cycling converges to the choice in which your temper difference learning converges to the actual value of the policy and the policy converges to the policy for the optimal value. Okay, so you have to sort of guide your system slowly to converge towards this point. Okay. That said, we, we've been looking in detail to what happens if you if you are working with a bandit. So, this is a recap of bandits, which is the simplest situation which you don't have any states. We've been seeing that learning the queue of a given policy is actually a very straightforward process, which basically amounts to, if you have a fixed alpha, so if alpha is constant. Then your algorithm simply computes some recency weighted average of your rewards. Okay. For one for each arm. Right. So remember that the algorithm here reads just like this. So your new estimate is your previous one. And then here there is the difference between the obtain reward and the previous estimate as well. Okay, so this is a sort of correcting algorithm, which amounts to say that you are doing some Geometrically and recess recency. weighted. Okay. So in particular, if you choose this alpha to be dependent on the army choose and going just like one over the number of times that you've chosen that arms so far, plus one. Then we've seen that in this case, your estimates actually are nothing but the empirical averages for each arm. Okay. So in this very simple case, doing temporal difference learning coincides with just keeping count of the empirical average for each action you take. So you pull arm one, and then you take a record of a I've been pulling arm one, say, and one times up to time T, and I've collected a certain number of rewards then I take the empirical average as some over the older was that collected for that arm. I divided by number of times that I visited that arm. And this is exactly my estimate for the value function of that arm. Okay, and you do this in parallel for all the arms you play. Okay. And like I told you earlier, the greatest strategy based on this estimate fails. Okay, so this is what we've seen operationally that there are regions in the space of estimated values in which you don't you get trapped and you never learn the value of the optimal policy and therefore you cannot get to the optimal decision making solution. Some rising. This is failure of greedy optimization. Okay, so greedy optimization means that you peak at every step your policy that the right from Q is our marks for a of your current estimates, this is the greedy option. This is not working. And then we, we just discussed without any specific proof, but with just a sort of geometrical intuition that one way out for this is to introduce some exploration. Okay, so there are different ways of doing that, and we will reduce some of them. But the message is that exploration is needed is actually is actually backed up by by a theorem. It's a result by lying Robbins, who provide the bound, which says that. So whichever policy, so whichever strategy use that is good. Okay, which has a technical meaning but in short means that asymptotically you always choose the right arm, but only asymptotically. So you approach at the best possible rate to the right arm every strategy that is good. Must visit. You must choose suboptimal options, at least with frequency one over time. Okay, which means that the cumulative number of bad choices must go like the logarithm of time. This one over T is, if you think about it is is a relatively slow way of giving away exploration. Okay, so reducing exploration. But if you do it faster than that. So basically, you are certain to be confronted with some situations there will be some situations in which the sequence of events will lead you to believe that your suboptimal was actually the best one and therefore, from that stick to that arm. Okay. So if you you don't, you are not allowed to go any faster than one over T in your reducing exploration in your attempt at reducing exploration. And if you go any slower than that. Another price to pay is that you are increasing your regret in the sense that you are playing your suboptimal choices more frequently than you should. Of course, one, one caveat is that the, here there are pre factors. Okay, the pre factors in this are dependent on the properties of the system itself. In particular, the precise statement of the land robin's bound is that as the number of as time goes by and increases and approaches infinity, the expected number of times that you visit a suboptimal. Okay, so this is not the one where you have the largest mean. And by the logarithm of time. This must be larger than one over the cool buck libeler divergence between the distribution of the suboptimal armor against the distribution of the optimal art. And this tells you that if it's very important, also the value of the constant here right so what is on the right hand side. It's very important so when you choose what constant to put in front of your exploration scheduling here. This is the constant. And this constant should be not too small, either. Okay, because this constant should be compared with this cool buck libel divergence. Okay. So if you choose exactly this rate or one like one over two you should be also careful about the constant. And the constant is connected to some properties of your system which you don't know. Okay. So this is this limiting value of logarithmic regret logarithmic number relative number of bad actions is however very fragile so you in practice you should stay a little bit away from it as close as possible but not too close because otherwise, you may risk of falling on the other side of the boundary. Okay, so it's a, it's a sort of a fragile boundary to work on. Okay, so very good. So this, this means that there is some inherent statistical reason for which you cannot explore too little. Okay, it's not just that the algorithms aren't good enough. Any, any algorithm which performs well, must explore enough. And this statement is made quantitative by this point. All right. So, one way of implementing exploration for real. Okay. This for instance through a silent reading. So, examples of exploration on the silent reading strategy, makes a very simple choice that is that every time the, the policy that you derive a time see from your queue. It's just given for HR. Okay, it's just given by the following choice so it's either our max. So let's, let's, let's write it more properly here because this is not good. It's just given by T is given by, you pick the action at which is either the arc max of your QT, so I think T plus one, which is either the our max of your current of estimates. Okay, so, for instance, you have all the sample means for the different arms, and you pick the arm with the largest sample mean, but you do that only with some probability, one minus epsilon. Okay, so you reserve some quota of your probability for exploration. And you do any any arm. Any action with probability. Okay, so what you do in practice is that you, you have your epsilon value, you extract a uniform random variable. And if this random variable is above epsilon, then you pick the maximum and react greedily. But if it's below the maximum you pick any action between the K actions that you have access to. Okay, so it's a very easy to implement. Obvious objection is that if, so these are remarks. So if epsilon is constant exploration never fades. Okay, so this is not a good idea. So this, this is a clear no, because in this case, the number of times in which you choose sub optimal actions grows linearly with time. So you always have a fixed probability, no matter how small it is, of picking something which is not optimal. So, a better way to do it is to go and schedule epsilon so I have some epsilon as a function of tea. And according to the result that we had from line robin's so one good idea would be just to choose it. So something like, for instance, one some one over T one plus some exponent, let's say, unfortunately, so many letters are taken, which one to choose, but let's say sigma. So with the Sigma which is something which is larger than zero. Okay, so you, you can make it as small as possible to get as close as possible to the, to the limit. Keep in mind that if you choose exactly equal to zero then you have to pay attention to the constant that you put here on top. So this is a reasonable choice and in practice it works pretty well. It has one disadvantage in that it doesn't make much of a difference between different arms. Okay, so you explore with the same rate, no matter which are you're actually choosing. Okay. On the contrary, it would make more sense to select the gritty the actions when you have visited them many times. And so you are comparatively more certain of the values that they have. And on the contrary, explore more the arms for which you have a less experience. And for them you expect to be less confident about the values that they have. You have less to use rates which depend on time but also depend on the action you're currently picking. Okay, so let's think about this thing works. You, you have several actions available. Okay, you pick, for instance, one of them, you just select gritty one action and then you choose how to explore. Okay, so there are two steps. You have your estimate. Okay, so you have your estimate. Then you have your gritty choice, which will give you a tentative action. Let's call it a tilde. And then you use, you explore using this exploration rate. Okay. So if you happen to fall onto that your gritty choice sends you to an action that has been explored already a lot, then you do that, otherwise you don't. Okay, so this allows you to keep more balance between different choices. So this, you do something very similar that is you scale these like the number of visit for that arm to the power of one person. So there are bottom line messages that there are several ways of doing exploration even in the simple epsilon greedy model. So I'm going to this, this kind of expressions like this actually I'm hitting sweeping under the rug, a lot of complexity in the choice because you're actually choosing a functional form for this. Okay, so what, what is the power. What is the function should it be something that goes down like a power law from the beginning or maybe something which is flat and then goes down. Okay, there's a lot of a lot of craftsmanship in designing the way you explore. So all these things are hyper parameters for our learning. Okay, which sometimes needs need to be carefully tuned. So if you happen to have some practical experience with these algorithms. You might. You might encounter the sort of feeding of this may at the beginning because okay but the algorithm was supposed to work and I suggest supposed to converge but in my hands it's performing very poorly. So one of the cases is because this requires a lot of fiddling with this hyper parameters. Okay, so there is this, there's always this little distance between the beauty of the theory and the weakness of the practice. If you happen to work with these things, just think about it. Don't feel randomly with parameters think about what makes sense and what makes not why your exploration isn't working and what are what could be the issues that you're facing. Okay, so thinking a lot about how you're either parameters should be adjusted saves you a lot of time in terms of convergence, practical convergence of your other. Okay, but this was a little bit like kitchen recipe so let's let's move on. Because I want to discuss before the break also other ways of doing exploration. Beyond reading. There's no way of doing exploration is to smoothen out the operation of taking a maximum. Okay, let me break it down for you but first and foremost, one thing that is not entirely satisfactory, even with this choice here, which is sort of the most refined one. In this case, we are not completely happy because we are not treating different suboptimal actions differently. So, but this one, what do I mean, suppose that you have three actions available, not just two. So there will one action will be the best one. And the other two will be suboptimal. So the one on greedy as we're writing this here is actually only considering the number of times that you have visited those actions, which of course, in turn they depend implicitly on the values that they have. But maybe it's useful to make this dependence more explicit. Okay, so why don't we explore also independence on how suboptimal and action is more explicitly. So, yes, if it's important to know how many times you visited that. But if you visited many times and this action is abysmally low. Maybe you should take that into account. Maybe it's not just out of bad luck that that particular action as a very, very poor estimate. So, it's true that statistics is important, but maybe you're over exploring. Even in this specific sense, you could do something better, you could be a little bit more greedy, maybe, maybe, maybe. Okay, so one way of approaching this different ways of exploration. It goes like this. So I will sort of give you the basic conceptual steps. So, let's, let's restart over from. So greedy means that you are picking basically what you're doing when you're doing greedy you're taking the maximum overall possible policies for bandits. Of the sum over all actions for your current estimates. So this operation here amounts to picking up the action that has the largest queue. Okay, it's a simple linear maximization problem. You have a vector queue, which points somewhere, hot queue, your estimate, you have another vector, which is the vector of policies which is positive. And it sums to one of the components. And therefore the best thing that you can do is align your vector in the direction of the largest component. Okay. So this is absolutely equivalent to taking the arc max. It's just exactly the same thing as you can easily check. How can we soften these requirements? Okay, so one possibility would be to do what is called an entropic regularization. So we replace the greedy strategy with this other request. So we want to maximize our policies, the same quantity. But now we add a bonus for probabilities which are not to focus on a single action. Okay, so this is constant where this beta appears as an inverse and beta is strictly larger than zero. And here we put the entropy of the policy, the Shannon entropy of the policy, which I recall you is defined as some of the elections. So you might recall if you had some a little bit of information here but Shannon entropy should be familiar to most of you, I think that the entropy is always positive, non-negative at least. And it is zero only if Pi is deterministic, which means that only if Pi is concentrated on a single action. And the maximum of this is the logarithm of the number of actions. Okay, remember that K for us is the number of actions so probably I've never used this K before so let's forget about it. Let's put just the number of actions. Okay, and the entropy is maximal only when Pi is uniform. We only want each action. So Pi of a is equal to one over number of actions. These are best properties of the entropy. And clearly, if we put this this term here. It's really an exploration bonus. Okay, because if you want to maximize the second term the entropy you want your probability to be as spread as possible. Whereas the first term tells you to concentrate as much as possible on the largest possible on the largest value of your cube factor. So here there is a trade off there is a balance between trying to maximize your value and trying to balance this with entropy. Okay. For those of you who have a background in physics this is very much what happens when you balance energy and entropy in your free energy. Okay, so there's a connection with physical physics as well. It doesn't matter. The bottom line of this is that if you perform this maximization, which is a very straightforward analytic procedure, you can actually find an explicit expression for your policy. And the policy that comes out is, let's call it by with a subscript capital B, which will become clear in a second. This is the exponential of beta times your estimate divided by normalization of this. So we're summing over all possible actions be this one's from one. And this base stands for Boltzmann, because this is connected to the Boltzmann distribution in statistical physics, which is the same thing, except for minus sign because in physics you minimize energies and in machine learning or possible and you maximize rewards, which is okay, but apart from that, that's why it's called Boltzmann. And this choice is also called Boltzmann exploration. I think of this beta as sort of an inverse temperature. Okay, so if your system wants to align with a certain vector, but then you have some sort of noise in the form of a temperature and effective temperature which makes it wonder a little bit around. Okay, so it's a sort of physical very, very loose interpretation of what this beta parameter means. But seriously, it's a trade off between this competing requirements of being focused on exploitation. Okay, this is a way to make explicit the balance between exploitation and exploration. Beware, there's nothing fundamental in this. Okay, it's just one intuitive way of writing down this balance but there's no underlying blow of nature by which this should be. The right way to go. It's one of a way. And, as a matter of fact, even though this approach is widely used, it's unsatisfactory under many respects. Okay. Okay, let's go one step at a time. First of all, clearly, if beta is constant, you're not going very far from the viewpoint of exploration because there will always be exploration. Okay, so similarly as for epsilon constant. This is not the solution of your exploration problem because you will keep on exploring, you will be exploring in a way that is different from different arms. You have to have variance with the situation with epsilon reading which randomizes blindly. Here you do this selectively. Okay, because your policy depends on your current estimates. So you explore less than what you do with constant epsilon reading, especially for the very suboptimal arms. But you keep on exploring. Okay. Beta constant as always. Not enough. So what if we do the scheduling of this inverse temperature. Okay, then it requires a little bit of calculations, but you get to the rather intuitive results that your temperatures should grow. The inverse temperatures should grow as the logarithm of time. Remember, if you want to go to the maximum. Okay, here there is a way if you send the beta to infinity as a parameter that this becomes the art max. Okay, so it's obvious from this, because when beta tends to infinity this second part disappears and you're back to their marks. But it's also obvious when you look at the solution because if you can send beta to infinity, there will be just one term that dominates here in the sun. And the term which dominates is the one with the largest estimate have q. So beta tends into infinity means that the temperature is going to zero if you wish. Okay, so in a sense here you are cooling down your system very slowly, which is a procedure that is known in physics and in statistical physics as a kneeling. It's something you actually do with materials when you cool them down very slowly in order to prevent them to become amorphous. So if you have a mixture of materials and then you cool them down too rapidly they might separate they might become so when you produce some materials and you want to them very homogeneous, you have to maybe to hit them up in order to make your mixture you have to cool them down very slowly in order to prevent separation. And this is something which is again very loosely similar to this. And this logarithm of T is something which also you can derive by arguments from statistical physics in case you're familiar with that. If you're not, doesn't really matter. So this is this lot is is the corresponding choice of one over T for for the epsilon really if you wish. Okay. Now, again, we are going over. This is also not a good choice because these beta parameter doesn't know which actions are visited more or less. We're still exploring in a way that does not care about the uncertainty of the estimates. So we have to put into this estimate something that takes care of actions. And the way you do it is just you. Now you introduce something which are temperatures for each arm. And this breaks down the connection with entropy. Okay, I like it before so we are departing from that because the entropy requires here or beta to be the same. Now so now we're making a leap and we say okay, let's improve on this by introducing some action dependent temperatures so to speak. And then one can can find out what is the best way of choosing or the one that gives you some theoretical guarantees on how to choose that. So, I'm not giving you all the details but I'm pointing you to a very useful paper, which is called Oldsman exploration done right. And the old doors. Chesa Bianchi. And you can easily find it out in the in the preprint repositories. It's a, it's a newspaper, I guess, yes, it's 17. Okay. So, one thing that is interesting in this approach. So is that if you just regardless of the way you choose this parameters. And that you should choose your arms according to your estimates by some choice like this, which you can see as a sort of a tweak the softmax choice. One interesting thing, which is a mathematical result. Is that actually, you can turn this softmax into a sharp max by a trick. Okay, this trick, which I'm telling you it doesn't require a lot of mathematical knowledge but it's called the Gumbel trick tells you that the actions that you pick according to this distribution above here. Okay, is the same. I'm speaking the art max over actions of your sequence of beta t a q t a plus some GA and this GA are independent. The Gumbel random variables. So me might not know but the Gumbel variable. Okay, so the probability density, F for a Gumbel variable G is a exponential, exactly exponential of minus g minus e to minus g. That's correct. Okay. So it's a weird probability distribution, however, it has a it has a very simple nice form something like this. Okay, centered around zero. So the, the, the nice thing is not about the Gumbel variable which is interesting in itself because it's related to extreme value distributions but apart from that, the interesting point here is that you can turn using this trick. Your problem of exploration by using some distribution source by softening the maximum like you did here into a sharp maximum, but now you have to put your exploration bonus here. What I wanted to do to take home is that you can do two different things either you choose to be our max and then soften it before, or you modify your estimates, because here you are modifying your estimates in a way, and then you do a sharp, our max on your Why am I insisting on this, because this kind of approach in which you modify your estimates connects with a class of algorithm, which are called upper confidence bound algorithms, which basically express the following idea. So, in short, you have your current estimates for your actions, and you must add some bonus for each of the actions. So these are basically your empirical averages. And here you add some bonus, which takes the form in particular for this algorithm like under set the conditions like this, and then you take the art box. So you construct your vector of estimates which are your empirical averages here. And then you add some bonus either random or not random doesn't matter. And this bonus means tends to promote the choice of these actions that you have visited the list. If you adopt this correction with this exploration bonus, then your UCB algorithms are provably near optimal in the sense that they come very close within the database bound. Okay. This is important. Just as a side note to say that there are very many ways of mixing in exploration together with your greedy policy, but they fall into two big classes, one big class is just to perturb your estimates, which is the second one. And the other one is to soften your arguments. Fine. Now we make I think we've made a tour of many different choices so if you happen to encounter them in the literature, you, they should somehow be not to completely obscure to you. And this is a definitely a good point to stop. And in the second half, we will go back to our original problem that is to use this exploration announced learning to achieve optimal control for over MVP without a model. Okay. Any questions so far. No, you look exhausted. I make a question. Yes, please. And I didn't understand what is the link between the entropic regularization and the Boltzmann Boltzmann exploration. Are they the same thing or are they. Yes, this one and the previous yes this this is a thing. They are exactly the same thing as you say, in the sense that the result of this optimization. The solution is this. You can work out the calculation if you write down explicitly this form. And then you take the derivatives and you maximize over the policies respecting the fact that they have to be normalized. If you do this simple exercise of finding the maximum over the policies that's the result you get. So they are the same thing. Okay, thanks. Sure. Any other question. Okay, so let's take a break and we reconvene at the, say, 10% okay. So we're ready to discuss. In general, the, the algorithms for reinforcement learning in model three case, and, you know, to do this. So, so we want to, like I said, we want to control our system in general for a generic market decision process. So what we're going to do now we're going to write a sort of master code. And then we will see how can we can implement the various modules of this master code in different ways. Master algorithm in different ways. Choices that we might want to do and what are the differences and advantages. Okay, so the basic structure of our algorithm is works like this. So initially, you have to initialize your parameters. Okay, so you define the state of space, you define your learning rates, initial rates or your initial exploration rates. Okay, it's sort of your procedure. And then you initialize your estimate of the Q function. Okay, so this is an estimate. You may have a hat on top of it but let's, let's not add it in order to do not to over burden the notation. So this is your initial guest for the Q function. Okay. And then you have a loop. Let's start here, which does the following. So, you may you may want to do so this is another possibility. You may want to break your learning into episodes. Okay, so you may want to run your update for some time and then restart again somewhere. So this is one possibility in order to account for this possibility of very multiple restarts in which is pretty natural because you have a goal task. So your episode ends, and then you restart it. But even if it doesn't end at some time you say okay maybe I stopped here because my value function is not increasing any longer. So I restart. So in order to include this, this possibility, we can have a loop over episodes. At the start of every episode we have to start from somewhere. Okay, so the first thing we do, we initialize a state. Okay, so we pick up a state according to some initial distribution that we choose, which should be uniform enough in order to allow the system to visit different states. Once we have that, we derive a policy by from Q. So what does that mean? For instance, this is the part where we have to use the balance exploitation and exploration. So for instance, with some epsilon reading or some marks or your favorite choice of exploration. With that, you can choose an action A according to your policy pie. And now you take action A and you observe the reward and the new state. So once you have that. So here enters to some modules if you wish. So the first one is a compute the temporal difference error delta. The second one is compute the eligibility trace E. And this is a matrix is a matrix, because it has eligibility for states and for actions here since we're working with state action values. So we will make everything explicit in a following but here these are two, two functions if you wish, which can have different content depending on the specific algorithm. So that's our algorithm dependence. And today, we will mostly discuss what to do about the temporal difference error. And we will keep the eligibility simple but you can combine the two these are basically two different independent ways of combining things you can have eligibility traces which could be td lambda or whatever. There are variations on the team. It's a very rich feeling itself. It's very well covered by the book of Saturn and Bartow, as well as this part of course but for the interest of time we will specifically focus on the part of computing the temporal difference error. Okay. Then when once this is done, we update our cue metrics. Okay, so this is often use the symbol to mean that you are updating so our new queue comes from the whole queue, plus some learning rate which can be dependent on time on action, etc. Let me use this symbolically. Times the temporal difference error, which is a scholar, it's a number. It's a real number times the eligibility trace which is in general a metrics. Okay, so this is a queue is a metrics state action metrics this is state action metrics. This is a real number this is a real number which depends, of course on queue implicitly. This is the LGBT trace which depends on the past history of actions and states that are big that are being listed. Fine. And then what we do basically is that we, we call our new state as the old state. And the loop. So this inner loop here, this is another loop. Here ends. Okay, because we have produced a new state and we can restart from the state that we have here at this stage. Okay, so the loop, the inner loop basically is this one. So we have this one episode, and we may have many of them by refreshing our s at random and repeating this loop. So this pseudo algorithm under appropriate choices of the functions that are algorithm dependent here, which we will discuss in a second. So this part is, sorry, this part is, converges with probability one to the optimal solution of weapons equation, given the proper scheduling of exploration and learning. Okay, so the learning rates, you know how they should look like, according to the sufficient conditions by the way, it's a moral. And for the exploration rate, you know that you have to choose them carefully. And for excellent greedy, a good choice will be basically grows to the inverse number of times that you visit that state action pair. Right. Good. So, so in this sort of super algorithm, we can then specify different choices for our computation of the temporary different server. So, and also for the eligibility traces but let me put here a little break and say in the following. We just focus on the eligibility trace of a certain state and action. It's just going to be one for the state that I've just visited and for the action that I've just taken. Okay, so this is the simplest choice of LGBT traces which corresponds to the algorithm TD note, which means that all the all the errors that I was serving are to be credited to the last state action by the right. So there is only one entry of my Q metrics here that changes at each time. And it's the entry which corresponds to the state action that I've just visited. And this is of course limitation but it allows us to write down to focus on the interesting part which is to come out to compute the temporary difference error which is this part here on the book by certain about you, you can see all these things combined in different flavors. Okay. Now, when it comes to how to compute different estimates for the temporary from Sarah. We will now review three possible situations which cover a lot of the work that has been done in the last 30 years basically. And the first of this is called an algorithm which is called the sarsa sarsa is a quite trivial acronym for state action reward state action. Because the way in which the temporal difference error is constructed here requires the evaluation of an additional state and action with respect to the additional action with respect to the usual stream that we use for that. It may be more explicit and so in sarsa, the recipe for the temporary difference error is the following one is. Yeah, it's just the reward plus gamma, my current estimate at the new state and the new action. Okay, so this is closely resemblance of the recursion formula for the Q value, okay, you remember that you have to average the average of this delta is exactly, exactly the recursion relationship. Okay, so this is a very natural choice from your point of stochastic approximation. And one key aspect is that you have to choose in advance your action a prime that you will use in the following. Okay, so this requires a double extraction of actions, which are the two days in sarsa. So you pick that action when you use always the policy derived from Q. Okay, so a prime is taken from the same policy derived from Q. So you use this policy twice in sarsa, you use it once. When you choose the first action here, and you use it also in the belly of this function inside this function to compute a new action a prime, according to the same policy that you are using at the moment, in order to construct your temporary difference error. And then you return. Okay. This is one possible choice. Now we will see another possible choice, and we will highlight the similarities and the differences. The second choice is called Q learning and the temporary difference error in Q learning is defined as follows. So you notice the slide, the subtle difference between these two. In what is happening in defining what is the Q value that you are comparing against for the future for the next step. Here you're taking the maximum of your Q function. So you're using a greedy choice just to evaluate the function not for the policy, the policy still is epsilon greedy, but you're using this maximum to evaluate a term in the temporary difference error. This makes it more similar to what you have in the bellman equation. So this is closer salsa is closer to the recursion equation to learning is closer to the bellman's equation that you take a maximum of your estimate. These are some advantages and some risks. The advantage is that this is more aggressive. Okay, and you go closer to your best policy. But also being more aggressive. In a sense, explores less inside this choice. Okay. No term exploration is a little bit of a misnomer here. But you sample less, because you are not given the possibility of sampling this. Because if you sample here, you have more noise because you're sampling another random action. So this object here fluctuates because of choices in different a prime that you might have this one doesn't. So, the good thing is that both algorithms provably converts to the optimal. Okay, schedule appropriately your learning rates and exploration rates, both of them converge, but their finite time performance could be very different. So before convergence, they might be doing very different things. Okay. In subtle embargo, there is an interesting example, which is given, which discusses the difference between salsa and Q learning for finite time performance. And let's discuss it very quickly here you can see all the details on certain. Excuse me. Yeah. Can I make a question for just just a small thing what's the difference between because I'm making a vision with an operation between the a uppercase prime and a lower case prime. Yeah, yeah, thank you. So, the upper cases here are meant to be the states and actions that you observe or choose along your sequence. Whereas here the lower case means just that that's the index of the metrics. Okay. So, right. Is it clear to anyone. This is just a formal operation over indices, the one that you have here. So you have your metrics you just look at all the rows. You just look at the row with the S primer which you have just visited, and you look at all the columns, and you pick the maximum. And all this as any means that you are focusing only on that particular item in the list. Okay. And then something about to discuss is called the cliff walking, which in a nutshell, it's, it's a great world problem in which. So your interest of going going from a start point here to a goal point here. And here there is a basically the edge of a cliff. So basically, if you step on the cliff, you get a very, very negative reward. So this region here. You get very negative rewards, because this means that you're stepping off the cliff and you're crashing down and you will be badly hurt. And it will take you a lot of time to recover. Okay. In words. At the same time, you get a penalty for every time you walk on to the ordinary part of the cliff, which is just the cost because time is going by. So the basic idea is that you want to go to the goal with the cloth with the sharpest. Sorry, the shortest path, which is something like this, because you want to get there in the shortest possible time. You want to know that the cliff is there. You're basically your. It's a little bit like you're blinded and blindfolded and you don't see the cliff at the beginning you start out. And you just have this map where you can sort of mark down the points that you visited. Okay, as usual thing as model free learning, you don't have a map of the system you just have to learn by experience, what is the best way to do. So when, when you run. When you run salsa on this algorithm. Again, we'd fixed exploration rate so we're talking about finite time performance of salsa. Not the asymptotic one which is optimal for both. Basically salsa what it does it just goes about like this and takes a safer path. Okay, so this is the safe path. Why does it do like this because there is this second action that is taken so it looks one action ahead in the future. And if it steps on the cliff, it realizes that this is going to be very bad. So this motivates it to go through the safer path. Q learning works much closer to the path to the cliff. So Q learning learns a policy which is closer to optimal. But the performance in terms of the reward is lower, because it falls off the clock the cliff much often. Because there might be noise in the transition, etc. Okay, so you see that these two algorithms, even though they have the same asymptotic guarantees when you just see how they work in practice. After a certain number of episodes, they may display different behaviors. And this uniquely depends on the only difference that there is. No, you estimate your temporary difference error. If you estimate it using one kind of policy and format work. And this is the first remark. The second remark about the difference of these two is that salsa. You see, always uses this policy pie, both for choosing actions and for estimating. Here, this new action here is estimated using the same policy. This is the same as in core algorithm to learning uses a policy to choose actions, but uses greedy to estimate. Now greedy, you can think of really just like another policy. So this is why usually we say that salsa does on policy estimation and Q learning does off policy estimation. So this is a degree of freedom, which you can use and sometimes is useful to use off policy estimation. So you, you act according to policy. So you want to sample your future according to another policy which maybe works better. So in applications from computer science. It's, it's absolutely legitimate. So if you think about sort of the, the neural interpretation of reinforcement learning that is how it works in our brains in decision making animals. It makes a little bit less sense because you cannot sort of disentangle these things that you are doing and simply in a different way. Okay, so it's less intuitive. But both things exist and this distinction is very important because as we will see in the next lecture, not tomorrow, but the next week, when you combine these approaches with the function approximation. Because now we're working in the tabular situation in which we work with states and actions which belong to Markov decision process. But the next step will be to use approximations in order to deal with vast state action spaces. And when we combine in conjunction, different learning function approximation and off policy. Disaster can ensue. So, off policy algorithms tend to have troubles when they are mixed in with the different methods and the function approximation. Okay, nothing to be worried about at this stage because this is something we discussed next time, but this is just to introduce already from now the notation of what is on policy and what is off policy. Okay. One more thing. We can have another choice, which is inspired from salsa, which is the following one. We really need to make it to take another sample, according to policy pie, or can we do something else, maybe something smarter. And this possibility is what is called expected salsa. Very similar, but now we replace this random estimate here with the average with respect to the policy. Explicitly, in expected salsa the temperature difference error is the reward. And here, I'm putting some of a policy a as prime. So, you see the differences is in this middle term here, which now is not a sample, according to the policy but it's already an average over the policy so this is an operation that you can do, because the policy is in your hands is the object that you have derived from your queue. So maybe it's a silent really so you use this explicit expression to combine the entries of your queue metrics in a different way. The advantage is that you don't have fluctuations any longer now. So this is reducing your noise in the estimated temporal difference error. So when you check when you check for the performance. So I'm not sure but maybe tomorrow that will be also an example with expected salsa, I hope so. You will see that the performance with respect to salsa is increased. And this comes with no disadvantage because this is still is on policy. Because you're using always the same policy pie to estimate your temporal difference error. So is there are less fluctuations still on policy. So, in fact, there is a sort of overarching scheme, which combines expected salsa and to learning, if you wish. So, you can also view. So let's call it to be another view on Q learning, you can write down your delta as R plus gamma, some over a prime of some other policy by tilde. And this this expression here that I just wrote down. It's just the same one that I have for expected salsa, except that now I'm using another policy here. It's not the same policy that I derived from Q in the previous step. It's another one could be anyone. If by till that is the greedy policy for from Q. So it's our max of Q. Then this thing is equivalent to Q learning. So by choosing your estimating. This is your estimation policy. If by till this equal to pie, then you are on policy. Everything is fine and this is expected salsa. But if you're fighting this difference. You have a continuum of choices which go from pie to the greedy choice. And the really choice corresponds to pure learning. Okay, so there is a full sequence of algorithms that you can choose which interpolate between expected salsa and Q learning between the own policy and your voice. Okay, so all these strings are these three things are really related between each other as they should be, because all these choices for the temperature difference error, eventually must converge to the bellman equation. So it's clear that they cannot be totally independent from one another. But this is allows us to sort of have a view on very different algorithms that have been proposed and their relationships between them. Okay. So clearly that this is was just a quick overview of the most important algorithmic approaches to different difference learning. Each of those may have better performances depending on the case at hand depending on the choice of the upper parameters. So there's a whole set of competing algorithms that are slight variations on the team. We don't discuss them at all, because this is not the purpose of this of this overall course. The purpose is to give you some general broad ideas from a high level viewpoint and but if you happen to open up a paper on reinforcement learning, you should sort of be able that within this class of algorithms which doesn't cover all the possible classes of algorithms we will be discussing the very different architecture in future lectures that connect to these ones but only in part. So but for this class of algorithms. The best concepts are already laid out in this in this lecture. Okay. So, I think we're done for today. Tomorrow, we will have tutorials on on this. exercises, especially a source and learning and hopefully also expected source. With that, I'm done and if there are any questions. Sorry, can I make a question. Can we say again what we mean when we talk about episodes. Okay, okay, so I didn't discuss this much, but let's, let's consider for instance the problem of cliff walking. Okay, so how do you learn in this particular problem, the optimal policy. So, let's look at the at the original pseudo code. Sorry, much here. The first step is that you choose some initial status, which in this case will be the starting state here. And then you start a loop. Start with your guest queue function, which maybe it's flat, it's totally zeros. So you take some first step at random. Then you pick an action, you observe some reward because there will be some cost either you walk off the cliff or you stay out. And then you improve your queue, etc. And eventually, eventually, since the domain is finite. Either by chance or not either you fall in work on the cliff. You collect many, many penalties, but eventually you end up in gold G. And when you're there, your episode finishes. Okay, because you have reached the target, one way or another. So you restart again from us, but keeping in memory all the experience that you've done in your two metrics. So you restart over again with a new experience, but with a few metrics that you've learned so far. And then you repeat this many times and each of these stream of data which brings you from the starting point to the goal point is one episode. Is that any clear. Yes, thanks. Okay. If not, have a nice day and see you tomorrow. Bye bye. Thank you. Goodbye. Thank you.