 All right, so for today, what we're going to do is we're going to do as promised, so we're going to moving from learning the value of a policy to learning the optimal policy. So we're going to do this in two steps. So for today, we will be setting up the stage and we'll discuss in detail the case of K arm bandits. Most of what we will discuss will be actually on the two arm bandits. And then for tomorrow, we will just introduce the general algorithm for state dependent situations. So ramping up from bandits to contextual bandits to general mdb. So that that's the plan for for today and tomorrow and then on Friday, we will have as usual, the tutorial tutorial on this part that is how to learn to control optimally system without a model. Okay, so we are. You remember we're always in the mother free setting now. Okay, so let's briefly recap what we had from previous lectures so we focused on the question of how can we learn given a certain policy by which is chosen by the agent. We aim at computing. What is the value function. Which is, I remember you the expectation in this particular setting that we use the expectation of the discounted some of rewards which we can write in this way given that the initial state is s. So, once more, in the in this expectation value it's hidden the fact that this expectation is running over a sequence of to a string of data or a stream as this also called, in which there is an initial state from that initial state. So an action that connection. So an action a note is taken. And this according to our chosen policy. And as a result, the reward is obtained and a new state is produced and this is due to the dynamics that is underlying our system that is transition probability which is, as you remember, unknown to the to the agent so the agent just observes this sequence of states. Actions. New states. So, did these rewards that are appearing here are produced in this sequential manner by interacting with environment. And what we have been showing in some detail is that it's possible to obtain an algorithm which just based on the sequential reads of this stream of observations, it's able to construct this vector of values. So we have to apply approximations of this vector V that converge with probability one to the true value for a given policy. Okay, so and the way that it is all good works is in general. So, then for a difference learning. So the pseudocode for this is in general is just initialize your estimate. With some vector. Okay, of course the closest it is to the true value of the policy the better it is. Then I teach time step. What we basically do interactively is choose an action. This is a loop so better loop. Here, choose an action. From the policy. Okay, so you observe your state. More precise even more precisely, the loop starts here. Let's make it proper. Starts here. So the first thing is that you observe. Yes, even better. Sorry for this. The loop starts here and you initially observe your initial status note, then you pick an action according to your policy. As a result, you observe a reward. And the new state. Okay. Then you define the temporal difference error, which is delta t plus one equals. Sorry, maybe I can put that more comfortably in another row. Temperate difference error that t plus one is defined as the observed reward. That's gamma. Your previous estimate at the newly observed state minus your previous estimate at the current state. And you remember that this type of difference error can be interpreted as the difference between the observed reward and the estimated reward. And in time t, we can construct an estimation of what you expect to be the reward of the next step using your current approximation of the value function and you compare it with the actual observation and this provides you with the temporal difference error which you use to correct your previous estimate. Then in general, you define an eligibility vector. Or you update better by some by some algorithm. Okay, so initially you have defined some also some eligibility in all. So you apply your, you're you're updated. Get your eligibility at time, let's say t. Okay. By some rule. Okay, for instance, just to clarify the idea. EG t of s is equal to just the indicator function of the state you have just needed for for TD nodes. Okay, so it's either you update it if you have memory, which is the case in TD Lambda, or you just define as the indicator function for the current state if you are TD nodes. Okay. So here you fill it in with your own favorite choice of LGBT trace which might be the ones that you saw in the tutorial or there are many many other choices that you can make. Okay. And this is the time to update your estimate and the way you do it is just you your new estimate at any state s is the previous estimate. Plus the learning rate which might depend on time. If you schedule it properly, it should be the error and then the rigidity at that time. And this loop ends until termination condition. So this termination condition could be the error between the difference between the previous estimate and the current estimate is below some tolerance defined according to some norm or whatever features. Okay. So this algorithm under the conditions under Robin's moral conditions, which are I recall you that the sum of the squares in the series of squares of the learning rates must converge in the series of the learning rates must diverge. So we have that the estimate tends to the true be by as the goes to infinity and the way it approaches it. So the convergence rate depends on the choice of the learning rates. Okay. So he the we don't discuss but it's it's particular it's not particularly complex or or informative at this stage. Okay. So this is in a nutshell, what the idea of temperate difference learning is, and today we would like to combine this with the notion of optimization in order to solve asymptotically development's equation in a in a much similar spirit. So just relying on a sequence of observations that we obtain by interacting with the environment, like this screen here, we would like to update our estimate of the value and concurrently update our policy so change our policy. At the same time as we learn the value of the current policy okay so this is going to be some sort of running on two different rails and trying to keep up in a consistent way. But before we go there. Let's just let's just take a very very simple example of how temperate difference learning works. This is even simpler than the than the one that you discussed at the tutorial so this is a this example service to purposes so the first purpose is just to a little bit demystify all the temperate difference setting in the sense that making making it extremely transparent in a very very simple case. Of course this doesn't mean that the temperate difference learning is a is a trivial concept it's a it's a concept which however connects very very clearly some very very simple idea. The first purpose of this number the second one is because we refresh our ideas about the bandits that we will be using subsequently so the example that we discuss here. What is the value of a policy for a K arm bandits. A K arm and it is a is a very simple object from the viewpoint of MDB is okay. So there is just one state. A single state what what is in the state for a K arm and it, well, a state for K arm and it means that we have K arms to choose with. Okay, so this one to K. And each of these arms has some distribution of rewards. Okay. So for instance for arm one, the rewards might be just like this distributed like this. So this is the density, the probability density of reward for our one. And similar that could be different distributions for the arms. I'm drawing them explicitly in this weird ways to convey the idea that there could be anything. Okay. So it's a totally arbitrary distribution. So you can literally put some conditions of well behavedness. So, when you have to study algorithms for bandits you often find the requirement that this distribution be bounded in the sense that there is a minimal and a maximum reward, or that they are a Gaussian, which is a statement about the tails in case they are not bounded. So you will find many many times this kind of assumptions. It's possible to discuss bandits with the broader distributions like power load tails or whatever so you, you might be seeing many of this in the literature if you happen to be interested in this in this kind of problems. There are many many test bets which are much simpler like Bernoulli bandits or Gaussian bandits. Okay, so they could be of any kind you should, your lack of knowledge about distributions could go from. I don't know anything except that it has some nice boundedness property or I could know that these are all Gaussians of the same variance but I don't know the means. Okay, so all this defines one state. So the state here is how we parameterize the distributions of the various arms. Okay. And once we have that state. There are many actions that we can do. Okay, they are K of them one to K actions. And for each of these actions. We get to the previous state. Okay, which means that these distributions don't change over time or don't change as a result of the decision I make, which means that in practice, we are dealing with the specific situation which goes in the name of stochastic bandits. For instance of getting back to the new state you get, you get your reward, which is picked from that this particular distribution. Okay, so if I pick it from action one. I go like this, this means that I'm picking it from this distribution here, and so on for different decisions. Okay. Now we suppose that we fix the policy what, what does that mean, as far as as we might think to just to fix the ideas, choose a random policy so we pick actions at random at every step. We, we throw, we roll a dice with K faces, and we choose which action to take. And then we observe the reward that gives. So our question is, what is the value of this policy, and how do I compute it according to temporal difference. Okay. So, and so in this case what we come to observe is not not a sequence of states because the state is always the same so it's absolutely irrelevant what we observe at the level states but you just observe a sequence of. We pull an action, we pull an arm, we get a reward, we pull another arm, we get the second reward and so on and so forth. Okay. So, yeah here in. There is a slight misalignment between the notation for NDPs and the notation for bandits. In usually in temporal difference, you define the reward as R sub t plus one to convey the idea that it depends on both the present state and the future state in general. So at least this is notation that a southern bar to use and promote for bandits usually you use another notation in the sense that if you pick an action a you observe a reward which has the same time label as the action so usually the stream here is defined as follows. This might be a little source of confusion if you skip from one notation to the other, but just a little bit of reflection should solve the issue. Okay. So this is also disclaimer that if I mess up with the notations that I can blame someone else for that. So, okay, and then you observe this and then you want to obtain what is the value of this policy that is generating these actions. Right, so now the value of a policy for a stochastic bandit is a very simple object because since there is one state. This is a three pi is a scholar. Okay, so it's just a real number. So we don't need to deal with the vector. And also since there is a single state. There is only one meaningful choice for the eligibility, which is just to give one. Okay, so eligibility. We cannot give credit to other states because that's that's just that safe. And there's no reason to give credit to different credits at different times because all the process is inherently stationary. Okay. And then, as you may recall from from our previous discussions. We don't play any role. The reason is that we at every time we pick an action independently from the past and we observe a reward independently from the past. So, when we have when we have a certain gamma, we basically the only thing that changes is that we have a factor one over one minus gamma multiplying the overall value of the function. So if you see that from the, let me open the parentheses here. You may remember that we have the, we have the recursion relation, not that thing, please. We have a recursion relation for the value for the value of a policy. These objects that you've seen several times so far. Okay, so for for a bandit. Remember, there is no state. So this is just a scalar, it drops all the indices here drop out this object here for a bandit is a one indicator function of the state which remains the same. So this is your policy and your policy doesn't depend on the state so we can just put pi away. So, if I use this recursion relation this just tells me that this is equal to some over a by a of the reward, which again depends only on the action, plus gamma be by. So I'm just been rewriting the recursion relation for for a bandit. And then, now you realize that if you open up the sum, this first term is just some over pi k. Okay, these are the average rewards remember, and then I get plus gamma, the pie doesn't depend on the action it's just a number so it goes out to some and here I have some over the probabilities of the policy. But this thing here is equal to one, because probably normalized to one. So, you see I can take this term here. And take it to the left hand side. And if I do that, I get that the pie is equal to one over one minus gamma, some over a. So you see that whatever gamma I choose, I always get the value which is proportional to the value with gamma equals zero. This is the pie for gamma equals zero. So it doesn't really matter to have gamma different from zero for bandits. And like I told you is just because you're basically repeating the same thing over over again. There's no notion of a future in practice. Okay. So, so far we can stick to, we can stick to gamma equals zero doesn't doesn't make any any difference. The indications are particularly useful because they basically break down temporal difference to a very, very simple algorithm, which just does the following. So, TD, again, TD whatever because it doesn't matter. The traces can only be one so there's no difference between the notes, the lambda whatever for bandits. Well this just tells you that your estimate and subsequent time is the estimate. And now these are scholars just numbers, plus your learning rate alpha t, and then you have your reward, which I know are using RT minus. This is what the temporal difference error becomes for bandits for gamma equals zero. It's just the difference between the reward and the current estimate. Okay. Okay, now we can further unpack this recursion to make it even more transparent to see what is happening here. And to do that with consider two situations so the first case is when the alpha T is our constant and only work to each other. So we know that this is not a good choice, because this is not going to give us convergence to the true value function because I mean at least, okay. The conditions, the sufficient conditions are not verified so we cannot rely on Robinson moral result in order to prove that this converges to the actual very function. Okay, which is a more precise statement. Okay, so let's consider this situation. So, let's just do this very pedantic way. So let's start with the guess and say that we choose zero for our initial value for just this just meant to simplify the calculations, but you could choose anybody. Okay. In fact, you don't know what, what is the distribution of your rewards that could be anywhere. It could be in the 1000s and could be in the minus 1000s be around zero whatever you don't know. So you start with your zero guess, and then you, at the first step. Okay, remember that you have your stream, you pick an action a note, and this as a result gave you reward our notes. And then what you do is, you update according to this rule without a constant, and then you say, okay, this is going to be the reward of my step zero minus the view step zero but these objects are both of them are zero. So at my first step I just get alpha times the reward that I just measured. Okay, so far so good. And then I do go to the second step. And the second step tells me that I have to be one plus alpha, the new reward minus one. And then I have just to plug in this expression from the previous time. And this becomes alpha. Plus alpha, R1 minus alpha, which I can expand to see that it becomes alpha R1 plus alpha, one minus alpha R0. That's the one of one, just for the sake of being explicit. So you start seeing the pattern, I hope. And then it means that I have to use V2 here, and use V2 explicitly there, which will lead me to, and I, let's rewrite this in another way which makes it even more transparent, I think. I collected the estimate V2 in the previous line. Okay, I extracted this one here, and then I collected that. And then if I plug that in, what do I get? Well, I get alpha R2 plus alpha one minus alpha R1 plus alpha one minus alpha squared R0. And then you can proceed inductively or you can just jump ahead and say that after these steps, your estimate will be like this. So this is going to be the sum of V prime from 0 to T of one minus alpha, the power of T prime. Okay, and you can check that this is indeed what is happening or you can take this expression and plug it back into the recursion here into this definition of the pre-difference and check that it's everything that's appropriate. Okay, so what is the pre-difference doing here with this alpha? Well, it's just taking an exponentially weighted, geometrically weighted average, and it's weighted with recency. Okay, recent stuff has a larger weight. Okay, so if you look at this expression here backwards, you see that your estimate at step three is alpha, your learning rate, times what you just got a time before. And then you get alpha times one minus alpha for what happened at two steps before. And then alpha times one minus alpha squared, what happened three steps before. Okay, and this is what is expressed here. So this is some sort of fading memory. Okay, it's a geometrically fading memory. So if you have your sequence of steps from zero, one to T, and you have collected rewards, which I can put as a sort of points here. So this is my reward zero, my reward one, etc. My reward T. That's how it goes, so I reward T minus one, just for simplicity. Okay, so what this object is computing is that it is computing some weighted average of this object, which is weighted with some geometrically or exponentially if you wish, decreasing. So this object here is this one minus alpha to the T minus T prime. So and you can see how these two extremes come to happen. So if alpha, so for instance if alpha gets exactly equal to one, what is happening here. Well, your kernel is extremely short, it just cares about the current moment in this case your estimate will just be your instantaneous reward. Okay, so it relaxes immediately to what you just observed, which is a good idea, if there is no noise, but a very bad idea, there's no noise. And also, in a multi augmented the reward that you take depends on the action that you just took so it has no memory whatsoever or all the other actions that you took so alpha tending to one is not a good idea. Okay, it has its merits to get a large alpha because it you converge rapidly to your value function because this gives you the rate of convergence of the sum. All these many terms, if alpha is big one minus alpha is small, and this sum tends to converge quickly to its average. But that's about of course, you are, you have a lot of variance because these things made plug to it. So you may want to see what happens when you go in the limit of alpha tending to zero, when alpha tends to zero. The difference is that you average a lot, because you some all distance more or less in the same way. But you have to wait a lot before your curve eventually approaches your current value. Okay, so in a nutshell what the difference with constant alpha is doing in this very simple setting it's just it's computing a running average of the rewards. So you have this running average with some exponentially fading memory in the past, and then you just keep on keeping track of this. So as such, it stays around the the mean, you can actually prove that the average value of this estimates is the true function, which is a nice exercise. Prove that the expectation of vt is v pi. So if you average over many many histories of actions and rewards. This is exactly the pie. The key observation is that everything here is linear. So what you do average is very, very simple. But you also can see that in this case, you can prove that the variance of vt stays by night. This is the drawback of having the constant alpha. And you can check how does it depend on alpha so the intuition is that as alpha goes to zero the variance should go down, but you will also realize how long it takes to get there. Very good. Now, second situation, which is also an interest, let's say that we take alpha t equals one over t plus one. This is a choice which resembles the situation for Robbins model conditions, only that strictly speaking this is outside from this moral, because it's just exactly on the border. The border is not is not good enough. Okay. Yeah, I guess it's a special in which you're some of our 30 diverges. No, sorry, it's inside. Sorry, it's just in the water but it's inside. Okay. Very good. So what does it mean to perform a temporary difference learning with this choice. Let's write once more the formula. And this would become one or t plus one. Is this expression any familiar to any of you. No, doesn't read the bell. Okay. So if it doesn't maybe what it would be is that a let's let's repeat the exercise that we did just a second before now let's start with that. t equals zero. Then we have our first choice which we tell you that I have to put equals zero down. So this is our zero. Okay, so here I'm just replacing V note here and here to zero. And then I'm taking our zero divided by t plus one but he is zero here. Okay, so this is equal zero. I agree with that. Fine, then let's go to T equals one, which gives me my estimate we do, which is V one plus one over two are two minus V one. All right, so this is our notes plus one half. Sorry, why did I put up to because this is where there will be confusion and confusion is V one to so this must be our point. This R one minus R zero. But this is R zero plus R one divided by two. Should I go on. Can you guess what this will be. Anyone wants to try. So if I had one sample. I had this, if I had two samples I got this if I have these samples like down the range. Yeah, the empirical me. This object is nothing but the empirical me. It's the sum of the rewards from time zero to time T minus one. And in fact, this object here is nothing but the recursive way of computing as a sample mean. Okay, so you, I'm pretty sure you have used this in the past. When you compute a mean of data which come one after the other. You don't wait to collect all the data, and then you some and then you divide but you do it online as the as data come you correct your current sum by this. Okay, which is again another way of interpreting at a very broad and we wish superficial level what temporal difference is doing. It's trying to compute an estimate, in which case this turns out to be exactly the sample mean. Okay, so you could do that by using a memory which fades out very slowly in time like one over T, or you could do it with a memory which fades out geometrically in time like this, in this case. Okay. If you do it like this you are guaranteed to converge to the mean why is that you're just by the law of large numbers. You could be killed like one over the number of samples, if you do it like this. Okay, which is another way to understand what the hell is going on with the first methods in this very simple setting, of course, generalizing from these to the full temporal difference problem. It's not obvious. Okay, so that's why I put the general framework before and the example of them. So you see just that this is one specific instance of a problem. All right, very good. So, now that we have for, I hope, come to the process of the mystifying what the general idea of temporal difference learning is. And nonetheless, understanding that it's connected with very powerful ideas, like, like stochastic approximation, which in itself is connected with stochastic gradient descent. Okay, so whenever now you encounter in your different excursions in machine learning, the notion of stochastic gradient descent, just be reminded that there is a continuous path bringing you from the simple notion of how to compute a sample mean of stochastic gradient descent. And these are the steps that connect the dots. Okay. Now, we are ready now, really ready to start for the real endeavor that is how to couple this kind of reasoning with the optimization, and we will do that for bandits. With that we will just need to enlarge a little bit, what we've been doing so far so I'm a little bit of a new angle on this, and then we're ready to go but for that I suggest that we start after the break, which could be say, 5% Okay. See you later. So, the last ingredient that we need in order to set up the stage in full for for tackling the problem of finding the optimal policy, at least for bandits today is to recall that we've been working so far mostly with an object which is the value function of the state. But there is another important quantity which is the value function of a state action pair. And it shouldn't come too much of a surprise for you to know that you can learn the state action value function as well by the same technique so just a quick reminder to what the this object, talking about is. So we introduced it, the quality or state action value function of a pair of states and actions is just the expected cumulative return. We're ready to go into the general notation, starting from a certain state and a certain action. Okay, so there's a double conditioning here over the initial states and the action that is taken. This object is a In fact, by unrolling this some year and extracting the first term and considering all the rest that happens afterward, it's easy to derive the following relationship between the state action value and the state value function. Okay, that's how the two things are connected. It's, it's quite straightforward is just tells you that if you start from a state and S and P connection a, then this will send you with probability small p to a new state as prime here. And as a result of this state action and new state you will collect an average reward the small r. And from that point on it will send you for a new state as prime and therefore what you gather from that point on is just a very function. Okay, so you have split the contribution of the action that you took at the initial time from what, what else is happening in the future. Okay. Any question. There's also a converse relationship, which is also straightforward is that the value function at the certain state to where this is just the average over the actions that you choose on the state action value. Okay, so this is again mapping the same language. So these two equations together clearly tell you that the value function and the q function are totally equivalent. And you can trade one for the other. The advantage of the value function is that is more compact. It's just an actor and the advantage of the q function becomes more apparent when one discusses things like the bellman equation. Okay, so I will make it show clear in a second. So just to complete the full picture and see the two things at work one beside the other. The first thing that one has to realize is that if we combine these two equations together and we replace the value function here for its explicit expression which is present here, then we can write the following relationship to my parents. Okay, so I just replaced this last expression for the value function here. So this is a closed linear expression for q. Okay. Which can be used for temporal difference learning, pretty much as we use the recursion relation for V for temporal difference learning. So this is the first, the first message. And we will do that in a second. But let me just stress that there it also is a bellman's optimality equation for q, which reads q star. So for the optimal policy is going to be first prime. The first part stays the same. As always. Now we have the maximum appearing here. And this also tells me that the optimal policy from state s is to pick the maximum of my star matrix. Okay. So this is the equivalent of the bellman's equation and how we derive an optimal policy from the solution of the best patient. This is where the merit of the use of the state action value function to becomes apparent because this just tells you that if you know the q star matrix. It's enough to look at all the columns. So, which is labeled by yes. Okay, so let's say draw the q star as a metrics in which there are states here and actions here. Okay, so all the columns are given by the actions and all the rows are given by the states. Then if you if you fix on one particular role. And you select the column, which has the largest entry. This will select the best action. Okay, this is what this formal expression above for the optimal policy means you just by looking at this table. You just realize for each state that is for each role, what the best action is. You just have to take the entry with the largest value of two star. So the connection between optimal action and very function is much more straightforward for few at the price that you have to deal with a matrix rather than with a vector. Okay, so there are advantages and disadvantages. Other than that, the two formalism are exactly equivalent. Okay, there's no shade shading shaded area in which one thing doesn't give anything consistent with the other. In particular, you can also connect this with the optimal state. This relationship. So in this entry here, this entry here is actually is also the star of it. That's, that's a perfectly matching dictionary. But we will come back to this later on. And now for the moment we just, I just want to stress that, given this relation, the question relationship here, you can derive a temporary difference learning for state action values. Okay, so given a policy, how do I learn what is the queue metrics for that policy. Well, it works pretty much like it worked for the value function so there are some things that we already did we have to move this left hand side to the right hand side and turn this expectation into a stochastic object. So we do not repeat all the steps because we've been doing that to far but the many times, but I will just give you the bottom line, which is that you can update your queue function at each step. We're going to a rule which is very resemblance of the one for the value function only that there are two important changes. Okay, so the two important changes are that your temporal difference error here as a different form in the sense that it's constructed is starting from your few metrics values. Of course, because that's the ones that you attend here. It's not the only choice but it's the most straightforward one. And then the second thing is that your eligibility traces. Now you have to give credit not only to states, but also to actions. Okay, so this is my eligibility. Sorry, you're missing a T in the top right. Yes, thank you. So, from your stream of states actions and rewards now you can construct this temporal difference error, noticing that you have to go one step further here with the action at time t plus one in order to construct this particular temporal difference error. And then you can use this to learn how to come to proceed. Okay. So, let's, let's go back to our example. Again, with bandits. So, with bandits now we have a few function of by which depends on the actions. So if you have two arms. This will be a vector with two entries if you have k arms to be a vector k entries. So how does temporal difference for you work. Like, okay, so this is a video. Okay. So does it look like, well, let's do that explicitly. My new estimate for every day is my previous one. Plus my learning rate. And then here I can use gamma equal zero so I can get rid of this middle term here. And I just can say that this is just my reward where again, there is this t plus one, two minus one, because people's one team with unity in the notation but this is essentially what we get. And this again has to be multiplied times one. If I beat the action. Okay, so here, here is where there is this slight difference between what we've been doing so far. Because what is happening now is that we update only the entry of our vector which has just been chosen. All the other entries we leave them as they work as they are because this indicator function tells me you got a reward. And the action which is to reward so sorry to give credit to or to blame for this result is the action that you just do. Okay, makes sense. But if this is a result, you know that this means that while you're working out your strategy, you will have to keep multiple records, one for each entry of your vector, and you will be updating them. You will be synchronously. Okay, so every time that you pick an action you update that, and all the rest is the same and so on and so forth. So, it will not come as a surprise that the, if you do the same kind of calculation that we need for the value function, but now for this situation. The formula turned out to be very similar, except that now you don't count time t but you count the number of times that you have done that particular action. Okay, so it's easier written than said. When you use for instance, your alpha constant. I will tell you that after these steps. If I'm not mistaken, again with the notation of perhaps be of the plus one steps. Your estimate will be like so supposing that you start with your initial guess always a zero. Your final result will be that you have to sum over all the time. And then you will have your alpha here, your one minus alpha here, but here you remember we had the, the final time minus the current time in our expression here. Okay, you see, we had this exponent here. Now we have to replace this by the number of times that we have visited the detection minus the number of times this detection of previous time. So that's the only change that we have made. So basically times just flows by the number of times that I'm visiting so every every arm as its own clock, if you wish, which measures the number of times that I can do that action. So this is not expressing just that that that you can work out the algebra it's rather boring, not particularly informative but the bottom line is that and here you are. Okay, doesn't matter the intuition is always the same. You have different records for each arm. Okay, this end of T is the number of visits up to time T, and of T is the sum of say I going from zero to T I hope I'm not messing up with the indices but please check me if what I'm writing is correct AI is okay so this is the just the counts this. It's a random variable which counts the number of times that you've been doing that actions for a fine whatever this is just to tell you that the same things apply only you have to do this in parallel in different records, and also you could define this is slightly different. For instance, so this is the first situation, you could also use adaptive learning rates. So suppose you define your rate. Now dependent on the arm. So every arm learns at a different rate. And this rate is now one over the number of times that you visit the dog so far plus one. So you have for every clock you have count the number of sorry for every arm you count the number of times that you visited it. And you learn with the rate which depends on the number of times that you visited. And if you do that, you can do the simple exercise, you will see that your estimate. Qt is the sample average of rewards. Qt of A of rewards for arm. Okay, so there is a parallelism here if you do things carefully you can map these things into sample averages as well. All right. Now we are more or less equipped with everything we want to proceed. So let's, let's try to face the problem now of coupling all this machinery, we try to find the optimal policy for bandits. So, what is the general idea that we have to deal with. The framework that we have in mind is the following. So remember that if we have some policy. This policy can produce actions. And from these actions we observe in general rewards. And possibly also in the states. Okay. So this is an estimate to an estimate of my Q. Okay, so this is the sort of the flow that different learning proposes you have a policy produce an action so reward update your queue. And then you start over again. Okay, you do one of the steps you update your queue, and then you start over again and then you start over again always picking action according to your policy pie. The idea that we want to explore now is that we want to close this loop. So, what is the idea. At every time step you will have a policy in your hands. Okay, so we're going to make this in time so like from from a certain policy by the certain time. Okay, so we use our current information about the value function, which we have at the end, in order to improve the policy. Okay, so remember this is what happens in the balance equation. In short, we will use this new update in order to produce a new policy. Okay, so we use our current information about the value function, which we have at the end. So this is the balance equation. In short, you have a queue function and from that you function you arrive a policy. This is also very similar to what we did with policy iteration, we had a policy computer bits value and use this to improve the policy and that computer value. Okay, so the idea is always this idea of cycling over the course of solving this problem by using experience cycling and improving. So, what is the simple idea. For instance, we might say that if we have a certain estimate to the county. We may decide that you can pick our policy and they as the art masks. Okay, so just again now I fall back to the bandits situation, my policy would be the one when I pick the arguments of our max over. So I have one vector, which is my current estimates of what the qualities of each are are, and I pick the one which has the largest Q. Okay, so once more. So think about this equation, I am using some policy and I'm producing some samples for each of the options that I have, say you have two arms. So I pull that and I get a number and then I pull that and I get a number that go on and so forth and I construct something which looks like the sample average of the two arms. And then at each step, based on which one looks better from the viewpoint of the sample average, I will stick to that. Okay, this is clearly dangerous. Okay, the scope the goal of this part is just to show you how bad this choice is. Okay, so a simple idea, but to naive. So where does it, where does this idea come from? Well, it does, it just comes from from the Bellman's equation. Okay, that's what the Bellman's equation tells you. If you have your Q star, sorry. Yeah, if you have your Q star, then you have your policy just by picking the maximum. But the point is that can you trust your current approximation? Is it accurate enough? Okay, so this is what you would do according to a so-called greedy choice. So how do we show that this is not what a good option would be? Let's consider the situation with K equals two. It's a two-arm mandate. And let's assume that the rewards are positive. Just does no specific need to do that. It's just for the what I would be showing in the following just graphically requires that the rewards are positive. Okay, so since there are just two arms, I can graphically depict my Q-vectors, Q-vector in this, on this axis. Okay, so this is component Q1 and this is component Q2. And let's assume that my, my the true values. Okay, remember here we are not Bayesian. There really are true values for the, for the bandits. The true averages are somewhere here. Let's see. So this is, this is the average reward for arm one and the average reward for arm two. Whatever distribution they are, they are okay. So they, they might be two gaussians, they might be whatever their means are there. And of course, we want to construct approximations of our Q that get better. And, but most importantly, they lead us to choose always the best one. And which is the best one in this case? So the best choice is arm one. Okay, because everything that stands, that stays below the diagonal means that the arm one is better than the arm two. Okay, so let's see how our greedy algorithm works. It does in practice. Okay. It, it starts from somewhere. So the first thing that we have to do, we have to pick up some value to start with. Okay. Let's just for the sake of just to clarify things, let's start with one choice. And we will do many of those. So let's say we start here. So this is a choice which looks like already very well informed. Okay. It's just like someone is whispering at your ear saying, listen, arm one is much better than the other one. Okay. Okay, fair enough. Let's say it's a possibility. Maybe we have just been lucky and we'll be starting from a good choice. Okay. We will explore many of those. Okay. This is just one to start with. So if you start with that choice, the first step, you just are greedy. And what do you do? You play arm one, right? And if you play arm one, you will update the queue one. So you will make a step in this direction. Expectedly so because it depends on the variance of your rewards in the first time. Okay. So and then you just move around and so on and so forth. But eventually you expect to be converging here. Notice that you never pick the action two here. Never. There's no reason whatsoever to pick it. So you will keep on playing action one. And you will converge to some point which is close to new one. So you will learn, of course, you will learn the value of arm one, which is your best arm. So good. As a matter of fact, in this case, you're playing optimally of the time because you've been playing the best arm since the beginning. So no problem whatsoever. Now, let's see what happens if you play starting from another choice. So for instance, now you, okay, now let's do the first stuff. And let's say what we've been doing until now. So let's start with zero. Okay. And since we start with zero, now we are totally indifferent. There's no reference. Both entries of my cue vector are the same. So I could choose one or the other. So let's say by chance I choose Q2. Okay. If I choose Q2, well, my new two here is much above. So reasonably speaking, this will go up. And I find myself in the uppermost part of the graph. Okay. What do I choose now? Well, I choose Q2 because I'm in the upper part of the graph. And so I will go on and play Q2 all the time. My estimate for Q1, which is and could have been way better than Q2, is staying at zero just because I don't visit that. Right. So I go here. I learned perfectly well what is happening for Q2. But the fact is that I don't get any money at all here. I'm losing money. So I'm collecting information about the wrong thing. So we could start from some other place. Okay. For instance, let's suppose we start up here. Sorry, this is not a good place. Let me put it even better. Let's suppose we start up here. Okay. Now also the situation seems to be somewhat dangerous because we are in the upper left part. But when we start doing things here, the nice thing that happens is that my estimate goes down. I'm playing on two. And then I cross this boundary. Okay. And then when I cross this boundary, I start playing on one, which allows me to converge here. And in this case, now that then I get stuck into the loop. So you see that the bias of the value from which you start becomes very important. So all these are sketches, which are admittedly very bad. The real situation when the biases are larger, richer than this, of course, but this gives you the idea. If you are greedy, the initial bias can kill you. Okay. So in fact, there is a full picture for this, which you can verify independently. There is a whole region of points, which is depicted here in dark orange, including the axis. So all these choices here. So if you start anywhere inside this orange area, you stay stuck and you'll never be able to reach the optimal decision, that is to choose part one. All the other points manage by one way or another. In doing so, they give up getting information about palm two at some point. So they sort of all converge to the, so all other points converge to this, to the axis here, somewhere along this axis. At different levels, depending on where they started from. Okay. None of those necessarily gets the right value for arm two, but it doesn't really matter because they perform all options. But of course, this is not what we want. This is extremely bad as a decision making process because we want to get, to do the right decision with high probability. Eventually, we would like to do that all the time. So is this possible? Can we, there are many questions that arise. Can we devise an algorithm that does that in a finite time? So after, can I construct an algorithm that tells me that after a certain number of steps, I will be playing the best action with probability one. Is it possible? Spoiler, no. Is it possible to have something slightly less than this? So is it possible to have something that approaches asymptotically decision making that the probability of picking the wrong arm goes down to zero with time? Yes. And how fast? So how fast can I approach my optimal behavior? All these questions are answered mathematically by the theory of multi-unvendants. So there is not, there's just books about it. Okay, so we have no hope whatsoever of being able to cover a large part of that. But what I want to give you today is that first, give you a first thing about how to fix this problem, which is very general and applies also to the more complex setting of reinforcement learning, so including states. First, second, I would also try to give you an idea of what are the limits. So what is not possible and what is possible and what are the limits given essentially by statistics on performance on decision making. Okay, these are the two things that I would like to give you in the next 15 minutes. Okay, so a better idea. So what was the problem here? Let's discuss it in a very informal way. So the problem is that we are working in the wrong mindset. Right? So let's try and think about the true decision-making problem. Like there is a teacher and this teacher has to decide whether a student is a good student or a bad student. And then this teacher gives some assignment, then receives the assignment, and then it has to make the decision whether to fail the student or not. And this clearly is a one-shot problem in which the teacher is in a very bad situation because there might be very various reasons for bad performance, right? So that particular day the student might feel bad, might have a problem, it just was the only question that he didn't look carefully into. Okay, so there are many sources of error. So a better strategy is let's repeat these things many times. Let's not just give one assignment. Let's give 10 assignments. And then after that, we can make a more careful evaluation. Okay, but it's also true that you want to give 1000 assignments before deciding. Okay, so as time goes by, you make your own idea about the student. So this very simple idea means that one has to balance the need for exploitation, that is based on the current knowledge. How much can I make a decision? And the quicker I make it, the better it is, of course. But at the same time, I don't want to kill exploration. That is, I have to allow for the possibility that since there is randomness in the world, things might happen just out of bad luck and not because there is a causal relationship. In a world which is perfectly deterministic, once you see an outcome, that's it. If you repeat the experiment, it will again give you that outcome if the world is deterministic. But if there is randomness, that's not the case. So we have to confront with that instance, with uncertainty. So good news for you students because I will try to explore as much as needed. So, but how to do it? Okay, a better idea is just to mix in some exploration to the previous idea. Mix in, not more, but just some because there's no input. Mix in some exploration. How to do that? Well, one simple recipe, for instance, would be let's define some small parameter epsilon comprised between zero and one, which we're going to call our exploration rate. And let's say that our policy now is, at every time, is the arg max of my estimate qt with probability one minus epsilon. Almost always in a sense, almost means depends on how small epsilon is. I will pick what my current knowledge suggests. But from time to time, according to this epsilon, I will pick any action at random. So if I repeat my experiment here, let's start in a situation where we were stuck, for instance, right? So I'm going to draw this in red. So we were supposedly stuck here, but then there is this small epsilon. So again, at the beginning, suppose epsilon is one over 1,000. So roughly speaking, for the first 1,000 steps, I will be repeating what I always do. Okay, so I will be playing only on two, and then I get stuck here. But then sometimes I start playing action one, and therefore this will cause me to very slowly crawl in this direction until I cross this magic point here, and I start playing action one, and then this finally accelerates and gets here. So this little noise that I'm putting into this work in my studio space is enough to let me cross this point. So we're done, are we? Not quite. So what is the problem here? So one, good. We are not stuck any longer. The bad news is that we keep on exploring, okay? So we will still do once every 1,000 steps the bad thing, and cumulatively on the very long run, this is going to cost us. So again, here you see there is another tension, okay? So bad. This is still too much exploration. So what we would like to do in fact is the following. We would like to explore more when we don't know things, and less when we know a lot. So one way out would be, for instance, let's explore more at the beginning and less at the end, or let's explore more when we have been sampling this action poorly, and explore less when we have been sampling this action a lot. So we have to find out a way to balance these two things. Okay, an even better idea. Let's schedule the exploration rate, okay? So let's decide how to change the exploration with time. So for instance, at the beginning we can set epsilon 1 at the very step, very first step. What does it give, does us? Well, if you set epsilon 1, we just take actions at random, no matter what our queue is. So it's a way of ignoring the bias, if you wish, and moving more randomly. So at the first step, if epsilon is very large, there is no such thing as staying in place. You just start moving around. And if the learning rate is also large, you will make large steps. So if you combine large learning rates and large exploration, your point here is basically moving around randomly. So you explore a lot. But as time goes by, you would like to, one, make smaller steps, which means alpha going to zero. Second, explore less, because you will be focusing more and more when you're close to target. And you don't want to wiggle around, okay? Fine. We know how to schedule the learning rate. But how do we schedule the exploration rate? So what is a good choice for decreasing epsilon in time in order to find the best choice? Okay. So your things become extremely interesting. Interesting. And that would be a lot of maths to do in order to prove the statements that I'm going to give you. So we are not going to do that. I will just summarize the results and give you some insight on what the problem is and how to sort of get a qualitative understanding, semi-quantitative understanding of what is going on. So the key idea here is the following. Suppose that in our two-arms minutes, okay? So we have a true value of mu1 and a true value of mu2 here. And these things here is called the gap. After a certain number of trials, suppose that I have played this action mu1 many times. So I have a distribution of values for my possible q1, okay? So if I repeat these experiments many, many times, under a certain choice of the policy, it doesn't matter. Under any policy which is, I don't know, under epsilon greedy, okay? Sorry. I forgot to tell you that this quantity here, this is called an epsilon greedy choice, okay? Because it's greedy, but with exceptions times epsilon. So as a result, I'm applying my epsilon greedy. I can, I play many times the action mu1. And I play, because my algorithm is working properly, I'm going to play the second arm much less than the first arm because it's an optimum, which means that I will have an distribution which is much broader, okay? Because the variance of the q is just goes like 1 over the number of times that I played with the arm. This is the variance of rA goes like 1 over the number for this, just like the variance of rA. This is the variance of the rewards. And sorry, this is the variance of the rewards. And this is the number of times that I visit that arm. If I make many visits, I will have little uncertainty. If I make a small number of visits, I will have larger uncertainty, okay? Okay, so the dangerous events are the events that are here in the tail of this distribution for q2. So those times were after this number of times that you have visited the say n1 and n2, these are the number of visits to arm 1 and 2 or the number of times that you've chosen them. So this tail here is the dangerous one. So what is the probability of being there? So what is the probability that q2 is actually larger than n1? Because this is the situation where you would be wrong according to the greedy strategy. And this is the one that you want to kill. Well, this goes, I mean, for Gaussians, for Gaussians distributions, this goes like exponential of minus the number of times that I played arm 2 times the gap squared divided by 2 sigma squared. So it goes down exponentially with the number of times that you visit that arm. So in order to kill this probability, to kill this, you want to have n2 large, but to win, you want to have n2 small. So there must be some sweet spot in there in which you balance these two things. Where is this sweet spot? Well, when you keep under control this green tail of the distribution, but not too much. You don't want to make this distribution q2 very, very narrow because you don't care. So this is one very important conceptual thing to bear in mind in decision making. Information is not the goal. Okay. The goal is to gain rewards. Let me do a very, very simple example. Suppose I tell you you have option A and under option A, you win zero or you win one, with certain probabilities. And then I give you option B. And the option B, you gain zero or something which could be anywhere between zero and minus one million. Okay. So what do you choose? Option A, right? But option A is characterized only by one bit of information. Whereas option B has an enormous amount of information. If the variables are continuous, it does strictly stick in infinite information there. But you don't care because this information has no value for you. Okay. So information is necessary, but only the bits of information that are important. Notice that information theory itself does not have, in itself, this notion of how much a bit is worth. Every bit is a bit. So decision making and information theory have a boundary, but one has to be very, very careful about what that is. So maximizing information in decision making might be a very bad choice. So you don't have to be curious, too curious about bad things. When they are statistically reasonably bad. Okay. So that's what you do. Okay. So that's where all the subtleties. You need to collect information in order to be statistically confident that that option is bad. And once it's bad, you don't have to be curious anymore than this. Okay. That's where the sweet spot between exploration and exploitation is. I'm talking like a guru or new age guru. So I better stop here and go back to the math. Okay. And the math tells you that this sweet spot is located when this N2 goes logarithmically with time. Okay. So this requires some elaborate math, but it's clear that N2 must grow. So the number of times that you visit some suboptimal art must grow in time. Because if you stop visiting, then there still is a small probability that things can be different from what you observe. Even if it's exponentially small. So N2 must grow, but it must not grow too fast. And it turns out that the sweet spot is to make it grow logarithmically in time. And the precise statement for this is actually called the Lyrobins bound, which I don't remember from which time it is, but I think it's in 85. Okay. I have it here. 1985. Okay. So quite comparatively recent with respect to the age of the problem of multi-unbanded. The Lyrobins bound tells the following, that the expected number of times that you play an arm A if A belongs to the suboptimal choices. Okay. So you have a K-arm banded and you pick an arm A, which is suboptimal. This is the number of times that you played that arm up to time D. And this is the expectation. You divide this by the logarithm of the number of times that you played. It's like T here. And you take the limit of T going to infinity. This quantity here must be above 1 over the Kurbach-Lieber divergence between the distribution of the arm of the suboptimal arm and the distribution of the optimal arm. So this is the number of PDF of rewards for arm A. And this is for arm, for the best arm. So that's a lot here to unpack. So this is an asymptotic statement. Okay. This is, this condition must be meant for any policy such that the policy by T asymptotically gets the arg max of the true values. So for any policy that asymptotically chooses the best arm with probability 1, there is a good policy, so to speak. Still, you have to pick the suboptimal arms at least logarithmically in time. And the pre-factor of this logarithm is something which is connected to an information theoretic object, which is the Kurbach-Lieber divergence between two distributions. For Gaussians, the Kurbach-Lieber divergence becomes the delta square, the gap square divided by the variance. Okay. So you sort of can connect with this kind of arguments that I have above. Okay. So, of course, this, deriving this formula is not very difficult, but it will require two hours by itself. Okay. I can, I will point you to the book, which works and elaborates on these embedments, where you can find all the mathematical results if you're interested. But the basic thing is here that this is the quintessential bound when it comes to balancing exploration and exploitation. It tells you that you have to keep on visiting apparently bad options. Okay. You cannot stop exploring, never. You should go on and on, but at the smallest acceptable rate, okay, which is, means that cumulatively, you must be logarithmic. The cumulative number of times that you visit a bad option must be logarithmic. What does it mean for the epsilon reading? This means epsilon three goes like one over two. Because if you pick the bad action every time step with probability one over time, when you make the sum of all overall, this probability is, this will be the expected number of times that you do the wrong thing. It would be logarithmic because the integral of one over t is the logarithm of t. And here you have to be careful about what you put here. So if you put exactly what something would like one over t, you have to be careful because if your pre-factor is too large, it's fine. But if it's too small, it may fail. All right. I've been overflowing. So very last message. You can do even better than this. You could do something adaptive like your learning rate could be depending on the action like one over the number of visits. So this combines the idea of adaptive learning rate. So adaptive exploration with a number of counts. So you can also adapt exploration also that the learning rate. And then there is a whole world of algorithms that we don't discuss. Sorry. That's not the correct title. So the go-to reference for this is Bandit Algorithms by Latimor Sipesvari, which opens up the discussion to all current knowledge up to, I don't know, maybe one year ago, two years ago, about what we know about Bandit Algorithms from the algorithmic viewpoint. Okay. I will post your link to this book of the instruction. All right. So we're done for today. Tomorrow we will see how to combine this idea, epsilon greedy with temporal difference learning and other kinds of exploration in order to produce good algorithms that find optimal solutions of the Batman equation without knowing the model. Okay. Fine. So if there are any questions, stop sharing. And see you tomorrow. Bye-bye. Bye. See you tomorrow.