 to help me with the organization. So can you please raise your hand if you are willing to help? Raise your hand if you are willing to help. Raise your hand if you are willing to help. Okay, so let's meet the, for those who want to help you in the organization, helping the woman, helping the guys, the musician, we need half a dozen of them, and half a dozen of them. Thank you very much. Thank you. Welcome back. So the purpose of today's lecture is to give you the essentials about what can you do as an agent when you don't have any knowledge about the laws of nature that govern the evolution of your environment. So you have to rely only on experience and perhaps some additional information about the structure of your environment, but not the details law, the details laws. So let's go back again to the diagram we drew at the beginning of the lectures ago. So on this axis, these two axes, we put our measurement of the environment, so the quality of the measurement of the monitored precision, and on this axis we put our knowledge of the environmental model. We first started up here with the mark of decision processes, and then we moved here with a partially observable mark of decision process. So in this case, again, on the far right, you have knowledge of the model of the transitions that the environment makes when subjected to a certain action from the agent. You know the model of the observations, etc. And we've seen that this is just an extension of this problem to partial observability. So if observations are not perfect, in that you do not observe directly the state of the environment, we just observe some corrupted information thereof, then you fall into this problem, and then we've been showing that with a big price to pay, that is to enlarge your space over which you operate, which initially is the space of states of the environment for the mark of the decision process to states and memory, and in particular, if the memory is perfect, you can compound this notion of states and memory into the notion of a belief, that is a probability distribution, which tells you what is your current belief about the state of the environment that you cannot observe, and we've seen that the evolution of these beliefs is controlled by Bayesian inference, and then coupling these two notions, then we came up finally to the definition of a deterministic decision process based on beliefs. So we've been talking about the value function of a belief and the associated Bellman equation. Now we want to move here. We want to appear the situation as the following. In our diagram for today, the situation will be this. So this is the agent. This is the environment. So the agent receives from the environment the percepts, which now are made by a couple of reward and the state of the environment. So the state is given, but the laws under which the environment evolves are unknown. As also unknown is, well, even the shape of the rewards could be a node. You just receive this data. These are just data that you get, but you don't rely on any specific knowledge about how they are shaped, and you might know in advance how many of them there are from how many states. So that's some information that you are allowed to know how many states your environment has, but other than that, you don't know much. How many actions you can take. All these things are available to the agent. Well, the actions, of course, because the agent makes them, but about the environment, you might know how many states there are in some case, but you can also, we will be relaxing even that hypothesis in the end. But for the beginning, let's start with that. So, and then again, actions are delivered. So the goal here, which is the goal of, let's say, initial framework covering possible learning is that you have to learn from experience. While doing things, you have to learn which are the best actions to take. So how do these algorithm work without the key ideas at their basis? That's going to be the content of the lecture today. So, in order to... So there are several ways of dealing with this part of the phase space. So there are several architecture that makes learning an optimization. We will be focusing just on one large area of those, which turn out to be some that are currently very much employed in literature, for instance, in AlphaGo. That's the kind of architecture. These are so-called critic-only architecture. This name will pop up in a second. And have a specific construction. But there are other different ones that we will hopefully discuss on next Thursday when we will be expanding on the knowledge that we have before. In particular, the algorithm that I will be presenting today, we will construct actually together today, is the algorithm that is at the base of the two demonstrations that... sorry, the demonstration that you will have tomorrow, where deep-Q learning will be used to solve the optimization problem for a cart-fold environment. And is essentially the same one that is used for AlphaGo. So what's the idea of these critic-only architectures? Well, the idea is that they try to export ideas that we developed for the Markov decision processes into the realm of experience. So can we, using notions such as the value function or the quality function, can we turn them into something useful, even in the case where we don't know the transition probability for the environment? So can we exploit what was a simple computation now? We'll have to become something different. We'll have to become something that we learn on the way, how to do it. And that's the key idea, mutating ideas for Markov decision processes to the case of experience. So let's go back and review one of the results, which will be our starting point for today. So you might remember that sometimes we derive through the use of Lagrange function, optimality equations for certain quantities. For instance, the value function, which we dubbed phi, is a function of each state. And the quality function or state action value, you might remember that this quantity is, of course, a real value and its value, its numerical value, is the optimal gain that you can get when starting from state s and picking an action a and then choosing the optimal policy. So the optimal policy is related to this quality function, by a simple multimon-like expression. So this is all true for any finite epsilon. So this was one of the equations that we derived from the stationarity conditions. And then the second equation was a closed equation for q, which itself goes under the name of the Bellman equation. And I'm just writing that for you again. This closed equation takes the following form. In our previous lecture, we focused mostly on the value function, but yes, this phi of s given a, sorry, phi of a given s, of course, thank you. Thank you very much. It's a probability of picking an action in a given state, sure. It's obviously normalized over the actions of the states. Thank you. So this is an equation which itself is useful if one wants to do the so-called value iteration technique, that is, we've been discussing this before. If one has knowledge of this quantity and this quantity, so if the model of the month is known and the function of the rewards is known, then one can solve this nonlinear equation iteratively by basically plugging a guess on the right-hand side for your q function. It will return a new approximation, which will then plug again here and then back and forth. And this iterative procedure is guaranteed to converge to the fixed point. And as we computed the Jacobian of the right-hand side operator for the value function, we could do the same thing for this case as well, right? So just to remind you, this equation could be written in an abstract form as some nonlinear operator which acts on q. And the conditions for convergence would be that the Jacobian of this nonlinear operator in any given q is contracting. This is sufficient to ensure convergence. So what one has to do is to compute the right-hand side, which is a function which is an entry S a. So this depends on S a and is, of course, a function of q with respect to q S prime a prime. So this is pretty much the same calculation that we did for the Jacobian in the value space. Now we're doing this in action value space. And if you do that, it's pretty simple. You just have to recognize that when you take the variation of this softmax function, regularize with epsilon of your q S prime a. So it's a softmax over the actions a prime and there is an epsilon, say, here. Remember that it's soft with parameter epsilon. So this is just the definition of the thing that you find below. So if you take the derivative of that, this gives you the policy. It's a simple calculation. Just like we take a little logarithm, this will fall down, and then you take the second derivative and this will give you exactly pi. So it's easy to show that this object is just gamma p S prime S a pi a prime S prime, which actually is nothing but gamma times the transition matrix which sends you from a state action pair to a new state action pair. So, and just as we said for the value function, this is a transition matrix. So it's a stochastic matrix whose eigenvalue have a spectral radius less than one. So overall, the Jacobian has a spectral radius less than gamma and gamma is less than one. And therefore also this operation of q iteration is contracting is well defined and sends you to the to the fixed point. Okay. Good. All of these as it is seems to be useless. You know, to tackle the problem when we don't know what this thing is, we don't know this and we don't know this. So we, how can we put this thing to use? Actually, that's something that can be done and something very simple conceptually. Yeah. Yeah, that's what I said is that, yeah. So this is a transition matrix. So it's radius will be at most one. Actually, it will be one if the matrix has exactly the matrix has properties that it has. It has a stationary state than it will be one. But gamma is less than one. So overall this thing will have radius less than one. Okay. Yeah. Sorry to say that again. No, it just gets, it can just sample the environment. It sends an action to the environment and the environment returns reward and state. And then again, reward and state. That's all the agent has access to. He knows the rewards. He knows the, every time this object along an evolution, this quantity is a random quantity. Here you assume that you knew, given S, A and S prime, you knew what was the value. You knew this function. Now you don't know this function. You just get numbers out of the environment. You just get values for this function, but you don't know how this is shaped. For instance, unless you visit a state, you will never know what's the reward that you get from the state. So you have to experience to know that. And on top of that, you don't even know the averages of this. Rewards might be random. Sometimes you get, for the same thing, if you are in a given state and you take a certain action, you might get sometimes plus one and sometimes minus one. This is just the average of these things that are happening. In the reinforcement learning case, you don't know what this average is. You just get samples, random samples, even for rewards. All right. So how can we put this knowledge to use? Yeah, it is extremely simple. So we have to transform this equation into something which is based on samples rather than on knowledge of average quantities, like the average number of transitions to S' given A or the average reward. We want to turn this into something which just relies on experience. So let me just rewrite this expression in another form, which is absolutely equivalent. And just the only thing I'm doing is that I'm pulling this thing on the other side of the equation. And then what I'm saying is that this equation is absolutely equivalent to the following one. The expectation value, the reward that you get, RT plus 1. This thing means that the expected value of RT plus 1 given that I was in status T to connection AT ended up in a status T plus 1. This is by definition R ST AT ST plus 1. So this is a truly random quantity, which depends on the history of the path that you're doing across your states and even on some inherent randomness in the reward distribution. So this thing plus gamma epsilon 3. So before proceeding further, I would like that we all agree on this, that this thing that I wrote here is exactly equivalent to this one. So what am I doing? I'm picking a trajectory in my state action space. So it's a result of my decision-making process. I do things. I end up in state S, pick up an action A at empty. Then what will happen, I will end up in a new state ST plus 1 with that reward. And then I average overall possible ST plus 1. That is an expectation value over the ST plus 1 that are picked from a probability distribution starting from state ST AT. So I'm just writing that equation in the form of sampling. There will be some reward. Might be even zero. Whose average is this function, which we'll be using. But sometimes your reward might be different. Say this reward is just distributed as a Gaussian. So every time you are in a state, you pick an action and you end up in another state, you get a reward which is not always the same. It fluctuates. But on average is this value. And this is the only thing you need in this equation. Is that clear? This is just for the sake of generality. I mean, I don't need this to be... It could be well deterministic. So there's no randomness in this. Every time you pick a triplet S, A, S prime, you always get that reward. This is just to show you that you can also encompass the case of random rewards in the same description. Just to allow for more generality. So any objection to this correspondence between the two equations? So in a sense, if you want, the correspondence between this equation and this one is just that this one is the one you get if you would do Monte Carlo simulation. So you have a trajectory. When you sit on a certain pair state action, you generate a jump and you evaluate this thing. Then you do it repeated many, many times, keeping a record of all these things that happen. You make the empirical average, and then this quantity should be exactly that term. So this is very simple but very crucial because we've been replacing the knowledge of this P with some sampling. So at this point, the agent who wants to measure this does not need to know what P is. It's just a black box which is sending out new states and it's sending out rewards no matter what the knowledge of the agent is. That's a few issues. I don't know if that adds much to what you're saying, but its average is always zero. So in this setting, we have to have some... Yeah, we want to find Q. That's different. We wanted to solve for Q here. Now we would like to find Q out of experience. At the moment, it's still unknown. I'm writing down an equation. It was an equation before. Now it's becoming something which is true only on average. So I'm replacing some deterministic equation with some sort of stochastic equation. In fact, that's exactly what we're going to do. We're moving from a problem where we want to find the root of a function because we can look at this problem as... So the original problem was to find the root of a multi-dimensional set of functions. So we had many functions. For each SA, there is value. It's a function which lives in the state action space, and we want to find the zeros of all these separate coordinates in this function. You see the analogy? And now what we're trying to do actually is we're going to ask a question. Is it possible to find the roots of this equation, the solutions of this equation without having access to the function itself but only to random samples of this function? So the thing which is inside this object turns out to be a random function of Q. If you have a given Q, this will give you a random value as outcome. Whose average is exactly the equation you're looking for. It's very important that you grasp the idea and then remove it. But how to do it? Again? So when you knew all these things, this was just a way of doing a calculation. It might be simple, it might be complicated, there are alternative algorithms, okay? That's fine. Now, when you don't know neither of these are there, what can you do? Let's rewrite this deterministic equation as a sort of stochastic equation in which the corresponding one is just this. The idea is, can we find the solution for Q here only knowing that this thing is not true at every time? The quantity which is within the average is fluctuating, it's not zero. It's zero only on average. But can we solve this kind of equations in which we are looking for the zeros of a function having access to that function? Another way of saying that is suppose you have some function, this will be the example that we'll discuss in a second. Suppose you have some function like this, okay? And you want to find where this zero lies. There is some function, it has a zero here. So one way of doing this is essentially this iterative formula that I was discussing. You start with a guess, you complete the F, et cetera, and then you converge. That's good as long as you know the function. And now I'm telling you, you don't know this function. You just know that if you are in a certain point X, you will get some value which is this function plus some error with zero average. So you just have access to a corrupted version which is nonetheless on average. So this cloud of points that you get, if you sample your function many times at this particular X here, you will get many samples. So there will be perhaps distributed like this, but their average will be the true function. So there is that at every time step of your evolution, you will be getting some of these functions. The error might be very large, right? I'd be even going down to very negative values. Observations of S are perfect, but this function is noisy. It's the observation of the equation which is noisy, not of the states here and here. This ST is a random variable which changes. Every time that you are in S and A, you do something, and then the environment returns a new S prime, which is whatever, depending on P. And then you just see, okay, this time I arrived in status prime. Another time it might be another arrival state, et cetera. So there's randomness both here and optionally also here. Do you see the analogy between this problem and this problem? We do not have access to this function. We just have access to measurements which are distributed around this mean. And just by sampling this thing one at a time, we want to find the root. If we succeed in doing so, we will have an algorithm which is only based on experience. It just will require pretty much as it did in this iteration method. You will need a starting point. So you will need to have a start, sorry, a starting guess for your Q function. And then as experience goes on and on, you will update it. So you might think of this axis as the axis where the Q values are. You start off with a Q value which doesn't obey your optimality equation. Then you pick some sample of this function and you use this object in order to improve. So we will proceed this way. I will give you the algorithm first. We will discuss why it seems to make sense. And then we will prove that it works, but not in the full setup, in a simple setup like this in order to make the calculation simpler and you get the spirit of why it's working. No, there's only one solution. You remember this one is contracting so there's just one solution. Okay, I promise by the end of the lecture we will show that we can go there with probability one by being sufficiently smart. Okay, fine. So the algorithm. And then we go back to this picture. So the algorithm goes as follows. It combines the idea from value iteration with the ideas from Monte Carlo. And it goes as follows. You start with a trial value for your Q function. You start with a guess for your Q function. Pretty much like you did when you knew the model. And then you update this follow. So this will be the previous value at the same state action pair. Then there is one parameter, alpha, which might depend on time. This quantity is called the learning rate or step depends on the... Then times what? Well, times these objects. Times this quantity, which I'm rewinding here again. So R, P plus one, plus gamma. Let me write it as a softmax epsilon over a prime of Q s prime s t plus one minus Q t. This is the estimate s t at. So this is the update rule. This is the update rule. If... So here there are two delta functions. So what does that mean? And then we go to the understanding of this quantity. So this is a matrix, right? It's a vector in s times action space or a matrix if you wish. So this matrix will be updated or corrected with the previous guess at time t only at the location of the current state action pair which has been experienced. So your process goes on forward and decides what to do. And then this produces this quantity and then in the entry s a which corresponds to the pair state action that you just visited, you modify that entry. So that's how it works in practice. You would write it down on the computer. Now let's have a look at this quantity between parenthesis because... Yeah, sure question. This is Q of t, sorry, sorry, sorry, sorry. Q of t and s t plus one. Sorry about that. Thank you. Can you read? I hope so. Does that matter? It's outside, but it could well be inside because just it doesn't depend on a prime. But it's also a nicer parenthesis here. Spend some time over this. So this quantity inside the bracket is so important that it has a specific name. It's called the temporal difference error. Why is that an error? Well, because it's how far you are from your desired value which is zero. You would like that quantity to be zero and then it's not at every sample. It would be so only on average. And then every time that you sample it there is an error which is the distance of that particular observation of that quantity with respect to zero. But it's also intuitively, we can split it into two parts. So this part is the actual reward that you get. So while evolving your system you will be starting from the state s t, p action at will be sent into state s t plus one and get this reward. This is what you actually get as a reward. This part here, including the gamma it's the predicted reward minus the predicted reward. Why is that? Because hat q t is your current estimation of what the real optimal value is for that pair of state actions. So starting from s t you would get this on the long run. But then if a step goes on at the next step you will get this discounted by gamma. So the difference between the optimal thing that you get starting from this point state action space and the following one there is a difference between these two things and the difference is given by the average reward that you get. So this is your estimate of how much you do expect to gain given your guess for your q value function. Is that clear? And then this is another way of understanding why it's called an error. It's a difference between the actual reward that you get minus what you would expect. And this also gives an intuitive idea why this algorithm should converge. Because what is happening? Suppose that your estimate of q is very negative, you're pessimistic. So your expected reward, your predicted reward is low. And then nice surprise you get a reward which is larger than what you expected. Then how do you react to this observation? You improve, you increase your expectation for that state action pair. You thought at the beginning that given that state and given that action the outcome would have been very gloom. But surprise you get something very good. Then at next round, you don't forget this, you just say okay, next time I get from state s I will pass by state s and action a I would expect something better for that. So it's just a corrector algorithm. It predicts, it measures, it corrects. So this is just what happens on one step but then the process goes on and what you do is you have to pick a new action from the new state s t plus one. And how do you do it? And you look at the same formula that we wrote for the Markov decision process and you will say that my new policy at time t plus one would be to pick an action a given the state say a prime s prime which is the new state will be just the exponential of q at time t plus one of a state prime action a prime normalized as usual. I'm afraid you cannot read so I will just rewrite this on the other blackboard. You're welcome. So the combination of these two formulas this update with that particular policy is what we will call soft because there's always an epsilon in the game and one has to be careful just as I finished this sentence. Soft q learning. Soft because there's an epsilon which softens the maximum. Hard q learning would be when you just take this epsilon to zero but then much care must be taken in this process. Please. Yeah, you can rewrite the same equation for pi t plus one replacing q t plus one and just shift with time one one forward. So I was lazy. I just okay. So here there are now two issues when one wants to prove actual results about this thing. One thing is about this alpha values. So these learning rates. So let's start from simple things. Suppose that this object does not depend on time. It's some learning rate, which is a number fixed. It tells you how prudent you have to be, you know, in increasing your estimation according to new knowledge that comes in. If you are wild, you would get this thing alpha to one. If you set one, you see that this will cancel out with this and you will jump to conclusions that this is actually what you will get, right? So one is evidently too large and it's the largest value of actual admissible for this. Normally you can put even largest value, but okay, that's for. When a becomes very, very small. That's better. On one hand, because you are very prudent, so you accumulate experience and then you sum and you converge. But of course it will take a long time before you get there, right? So this is not desirable either. So in order to achieve fast learning with good performances, this alpha must pick some sweet spot. And we will discuss this, how to choose this kind of sweet spot or how to play with this in the simpler situation of stochastic loop finding for a one dimensional function which will come in a second. Is it noon? Here there isn't. Here there is. So another way of writing this, if you don't feel at ease with that, I can write it down in a different way, which would be, this is the update if s is equal st and a is equal to at. Otherwise, otherwise q at time t plus one s a is equal to the previous one. So you do not update the value if you have not visited that. Just the last state you did to visit then you can assign. Basically this is the issue of what is called in decision making the temporal credit assignment. So what does that mean? You're giving credit for this change in what you expected to what you observed just to the state that you visited before. Just to the action that you just picked and the state you just were immediately before. Of course, that's not particularly smart because the current state connection in which you are depend on the many steps before that you've taken. You arrive there somehow. So there are other ways of improving on this temporal credit assignment problem which we will not have the time to discuss but they make one chapter in itself in the Saturn and Bartow book. So there are algorithms which improve on this simple algorithm which is called a temporal difference zero algorithm because it just looks one step back. There are ways of improving on this idea of giving credit even to stories in the long past. There's a way of improving on that but we do not have time to discuss this. And by the way, this is not so often implemented in... Well, there are boundaries. Yes, in a sense. But there's a range. There's a full range that you can... It's not that you have to set itself at a specific value. There's a whole range of things that work and you will get pros and cons but we will discuss this in a second. Okay? Any question? So you see how it works? You could implement this. It's very simple. Optimally in what sense? Optimally in speed of learning. No. It's very difficult to know this. It depends also on the target that you ask. So I was discussing exactly the problem of... So let me talk a couple of minutes more and then you will ask again the question if you're not satisfied. So like we said, does learning which has to be somehow regulated in a way to achieve optimal performance and actually one way of doing this is to reinstate some dependence on time. So you want perhaps one good idea is at the beginning when you just start out of nowhere you might want to learn quick and then as you get closer to your target you might want to slow down your learning. So this kind of a technique is called scheduling can be done in a correct way. Okay? There's a range of values in which it gives you this guarantee that this algorithm will converge with probability one to the maximum and then how fast it is, it's another issue. But being fast is not always accompanied by being robust. So there's always a trade-off. You might have very quick learning but very fragile in that if you change a little bit your parameters or your scheduling, this might crash or you might settle for a slower learning with more robustness in which it works basically under all conditions. So it will depend on this. But then there's also this thing. So if you want to convergence with probability of one of this whole algorithm you will have also to send epsilon to zero because eventually your policy will have to be deterministic, we know that. So there's also a problem in scheduling here in reducing the degree of this. Does it sound familiar? So you remember what kind of interpretation could we give to this epsilon? This is like annealing. It's just like you start with a temperature which is high and then you slowly reduce it if you want to go really down, down the well. And again, like in physics you can't do that arbitrarily fast because if you are too fast you risk to get stuck into some nasty local minimum. And here the same happens. Only the language is different. In the Reforcement Learning Community this temperature is what is called the exploration. So exploration means that you're not acting as optimally as you would on the basis of your decision. If at any time t you send epsilon to zero this would be a strict mass and your policy will be deterministic based on that action. But this is a very strong assumption because you're saying that your sample that you constructed, your approximation is so good that it actually captures the right behavior. But it is a sample. How much can you trust in that? You would have to allow some room for exploration for the possibility that given your finite experience there is something that you just haven't seen out of luck, out of bad luck. There are some experiences that you cannot account for. Keeping this exploration parameter finite and sending it down very slowly is the kind of thing that you want to do. So two things. In order to get optimality, true optimality you would need to scale down this learning rates to zero. So slow down the learning and also reduce the exploration. So it's been for a long time quite a challenge to understand what are the true mathematical ingredients and how you should act on this thing. Be advised that most algorithms there would be some fixed learning rate, some fixed exploration rate, perhaps run many, many trials and exploring the state of parameters and trying to find the sweet spots empirically. So these ideas are more important from the theoretical standpoint. In practice the ability of finding good schedules, et cetera on one side are a little bit of craft manhood. Some people are very good at shaping these things. Well, some others not. And on the second hand you almost ever in real reinforcement learning tasks care about absolute optimality. You care for something which is approximately optimal. So there's no need to push too much on this scheduling issues except for conceptual and theoretical questions. No. No, because still there is a randomness. The only situation where you can do really fast is when your model is deterministic and your rewards are deterministic. Then every time that you visit the state you will know for sure what the transition probability is. That's the randomness inherited by ST plus one. That's what you're saying. Yes, I'm not sure I get the point because here you have to sample from this anyway. So there is randomness. Again, the only case when there is no randomness is when rewards are not deterministic. They don't depend on ST plus one and your dynamics and your dynamics on ST plus one is deterministic itself. Or if it's deterministic actually this is just a function of this too, so you don't mind. This is a very, very narrow application. There's not much to do in that case because if you pick an action you will be forced to go into that particular state. In that case, you can learn much faster, but that's not typical. Okay, so I hope I covered all the conceptual issues. So let's move to a little bit of calculations to show why an argument like that works. Intuitively you sort of got the idea, but now we will do a very simple calculation which shows actually how to schedule these learning rates. Okay, we will not have time to discuss the scheduling of exploration rates, but we can discuss a little bit learning rates. So I will erase the blackboard and I will rephrase this problem into a much simpler version. I hope you recognize immediately the similarities. So the conceptual problem boils down to understanding a much simpler problem which is the problem of stochastic root finding. So this is a part of mathematics which dates back to the 50s of the previous centuries. The major contributors were Robbins and Monroe, 1951 and Bloom. So Bloom is just the multi-dimensional version of this problem, and we will discuss just this one because it's rich enough and then it's just the technicality to extend this to the multi-dimensional problem. Of course, for value learning, for Q learning, this is the relevant technical result. So like I drew you before in anticipation, let's suppose that we have a function which is like this. Let's suppose that within a certain interval, this function is increasing. This is to guarantee that there is a single zero. Of course, it could as well be decreasing. The derivative would be negative then. Then this function has a zero, psi, and we want to find the zero. We want to find the psi which solves this equation. But there's a but. We do not have access to phi, to f. We just have access to some random samples of this function, which we will call phi. And there is that the expected value of any of these random variables, which we can say is a function of the position, is actually f. So the algorithm must rely only on this observation. So if you are in X, you will measure some phi which might be here. Then if you try again, at the same place you will measure another one, and again and so on and so forth. Let's first try to assess the difficulty of this by doing something which is very bad, a very bad choice. A very bad choice would be, suppose that I scan my X space, and I sample, I don't know, 1,000 values at each point. And then I come up with some empirical average of this function. You see what I mean? And then you search for a zero. Is it a good idea? Well, that would be right if your distribution is symmetric, but you don't know the errors how they are distributed. You don't know anything. Otherwise you would have a model. Exactly. The trick is not to try to compute this average, because the actual problem you might run into is that you get samples, and this would be just like I'm drawing this in another color. If you did like this, you would end up doing something like this. There would be something like this. And then when you're close to zero, there would be noise. And it would be very hard to settle there. So you need few measurements out there, but you need much more measurements out here, in here. Yes? Let's assume you have an error only on the y-axis. Just for simplicity. If you add further error on the x-axis, that would be qualitatively similar to the fact that you have imperfect observations of your value function. That would complicate them also. But just for sake of simplicity, let's keep it like this. So this is not a good idea. The good idea is just to start out, make some steps, and then further refine and refine and refine. Actually, the algorithm that does this job is very simple. And I will write it down, and then we can have a look at it. The algorithm is start with some x-zero. You choose it within your domain of interest from interval. And then you do this. The next guess is the previous one, plus some alpha n, which is again the learning rate, which depends in general from the step you're taking. And then this is the error, is minus alpha n phi. So how does this work? Well, if you knew f exactly, phi exactly, suppose there's no error, then you know that if f is positive, you're on the right side of your zero. And so this will decrease it. If you're negative, you're on the left side, and this will increase it. So the sign of these things depend, of course, on the fact that you chose this prime. So there is that, okay, we don't know this for sure. So if the error is very large, here it might well happen that sometimes if you fall on the negative axis, in which case this algorithm will tell you, go the other way. So this algorithm strictly here is sending you away if the error is so large that you fall down. But then if it sends away, then you will sort of go back because you will measure and you will add knowledge, and then you eventually will reduce the error locally, and this will again send you back. Does it always happen? No. It happens only if you choose properly this sequence of learning rates. In general, if this thing is finite, the only guarantee you have is that you will approach some neighborhood here. But you will not be able to shrink it to zero. And the size of this neighborhood will be smaller as alpha gets smaller. And then the time needed to get to that point will be longer as alpha gets smaller. So in order to try to combine these two things, arrive at your zero in a finite time, arbitrary close to a finite time, and with ever-increasing accuracy, these learning rates will have to be scaled down. And how to do it is encoded in these two papers for one-dimensional and multi-dimensional. And this is pretty much the basis for all reinforcement learning algorithm based on experience. So model-free, that's a key result that we will spend ten minutes in deriving because it's also very simple to derive. So first, we want to show that with probability one, when n tends to infinity, x n tends to xi. So one simple way to do this is rather than working with the probabilities which are themselves, let's just work with the square distance of our current gas, x n, from xi. So let's construct this quantity. So this is the expectation value of the square distance from the target, which we don't know. But we know that if this quantity goes to zero when n tends to infinity, with probability one, we are there. Because since this is positive, it can only have support when this quantity is zero. So convergence in square is equivalent to convergence in probability. That's pretty simple. So this is an average over what? It's an average over all the history of the measurements of these five functions that have been there before we got to time t plus 1. And then the first step is very simple. It's just we have to replace the definition of x n plus 1 here. So what does that give? It gives the expectation of x n minus xi of minus alpha n of x n of this square. And then again, very simple algebra. We split this into squares. And that's going to be the double product minus 2 alpha n expectation value of x n minus xi of x n plus alpha n squared of x n minus expectation value of... Now, so these two terms are two consecutive terms. Why? Sorry. Because it's a fixed sequence of numbers which are not random, are in our choice in the algorithm. So we fix the sequence of alpha n at the beginning, which doesn't depend on the observation. It's something which is in our hands. It's going to be another row. One, one half, one third, one fourth. Then if we look at these two terms here, they're just one after the other. So we can pull this term on the left-hand side and then sum over all ends until given time. So this will be a telescopic sum. Okay? So if I do this, what I get, and I sum until time x plus 1, so this will be the result of the telescopic sum. Now we have to work a little bit on these two terms. This is the quantity we are looking for. And in the end, we would like to show that this object in the limit goes to zero. If certain conditions on this alphas are enforced. So let's go forward. And one thing we want to show first is that the following one. So let's rewrite this term here. Let me rewrite this same equation. Let's forget about this. And then I'm going to pull this thing on the other side. And then I'm going to say that this is minus. And this is equal to this. And then let's call all these j's rather than n. I'm changing names. And then I'm summing over all j's from 1 to n. This is telescopic because every term is minus the term below in the sum. So they all cancel out. And on the right-hand side is just what I wrote here. So let's work out a little bit this term. There should be xj. Totally correct. There is xj on my notes. So let's first have a look at this average. So it's an average over what? Overall measurements that have been done in the past and present. So this measurement of phi, which depends on xj, it's something which is totally independent of the past. So I can put that average in here. So this object is exactly e of xj minus psi times f of xj. Mind that this is true because we have this. And because at every x, we make an independent measurement. This is an assumption of the algorithm, which simplifies a lot the derivation of the algorithm, but might not be necessarily true in reinforcement learning applications. So this will require for some care. And Matteo in his tutorial will tell you how to deal with real tasks when the noises are not independently distributed. So they not need identically distributed, but they just ask to be independently distributed at every time. So this is one thing. And then we would like to get rid of this term somehow or transform it. And one thing we can do is just let's rewrite this in the following way. Let's multiply and divide by xj minus psi. Now have a look at this pre-factor here. This is, at any point here, it's the tangent of this curve. It's the difference in height divided by the distance from zero. And we can say that this thing is bounded. We can say that this quantity actually can take the infimum of a row of possible things and we can bound this object by some function which is g. You will notice that I'm not doing like mathematicians do. So listing a series of assumptions and then going on. I just want to introduce this assumption on the way in order to make things progress. And here the thing that worries me is that this here is a negative term and I don't want it to go too fast basically because it has to be bounded by something. And this is a natural way of bounding this object through this thing. You notice that also that this means that your function should not become too flat. This means that everywhere in your domain you don't have situations like this. This you don't like because this will be strong sources of error here. If you stop here you will be going back and forth a lot of time. So this assumption basically sets a limit to how flat can this function be. And you will see that the flatter it becomes in between the longer it will take to learn. We are asking that. We are asking. We are saying that our function has this property. It never gets arbitrarily close to zero the derivative. When you say it's positive this assumption is a bit stronger than that one. It just tells you that everywhere your derivative is larger than some value basically. And it's a way of getting rid of this term technically speaking. If the function is greater than zero it's a mess like I said. Everywhere you don't want this. In order for this algorithm to converge there must be everywhere some finite good derivative that pushes you in one direction or the other. That's something intuitively you want to have. Yes from here to here I just multiplied and divided by this. Ah from pi to f. So yeah let's do it in in in a line here. So the expectation value of xj minus psi times pi of xj depends on on the full history of observations because xj is the result of several iterations here right. So let's split it into the expectation value of xj minus psi pi xj given the last point. And then average overall the previous part of the history which brought us there. So this is the until step j minus one. But then if I take this. This out of the sum of the average story. And then I get expectation of pi xj given xj. And this object is f of xj. You just have to split your history into the last step and all the previous steps. And then this becomes exactly what I wrote here. Good. And so one source of trouble was that we wanted to avoid situation where our function was getting too flat because that would be a trouble. Another possible trouble is when your noise is so large that your variance gets infinite. If the variance of your noise is infinite, then you will never be able to converge because this will everywhere send you very very far away. And one thing that we're going to ask is that this quantity here is finite. This object is less than equal than some function H. This one. This means that no matter what your errors are, they might be small large depending on the position, but they all bounded in variance by some by some quantity because the function is bounded itself. And then the second moment of pi then therefore is just the variance plus the mean square. And then if one is bounded, the other one is bound. The G is a number and H is a number. These two are numbers. G is strictly speaking G is defined as the infimum overall your domain of the function fx divided by x minus psi, which you ask to be positive. Yes. Good. Now we are basically arrived because we can write. So I will erase the first line and rewrite down the result of our manipulations here and show you the. We have the following inequality. So this is an inequality. Okay. And this is also an inequality. We are bounding this left hand side from above because this is minus something which is larger than so overall I can write down that. So what why am I doing this that should be pretty clear from here. I want to find an upper bound to this quantity because if I'm able to show that what on the right hand side goes to zero. Then I will be able to show that this goes to zero because this is positive by definition right. So here is trying to construct a reasonable upper bound to that quantity on the right. And what is it? It's just this object minus two times G which is our bound and then some over J from one to M of alpha J of this and then plus H times some over J from one to M of alpha J squared. The first. Sorry, this is a two. Sorry. I was having difficulty in the coding the question but it makes sense. Yes. The expectation of the square is lower bounded by H. So variance of the error must be fine. Good question. In general, we don't know. Okay. So the actual value of G and H would be could be used to find the best choice for the learning rate. But in general, we don't know them. So we have to go with some guesses. Okay. So that that's a very pertinent question that the value of G can be somehow computed for for the reinforcement learning tasks that I was discussing before. The properties of the transition operator, etc. So one can have bounce one can construct bounce for that quantity as well. But in general, these are unknown. So we will come out with something which tells you that you should choose a certain rate with a certain values of G and H given that you will have to choose conservatively. But let's go to that in the end. So we're almost there. So I'm raising this because it's already contained up there. So now what do we do is we we extract this quantity and we just rewrite our optimality equation as follows. We have some of the J going from one to N of alpha J. So I'm pulling this thing on the other side and saying that this thing is just I'm picking this part of the inequality, right? So let's forget it for a second about this. This was just we just care about the fact that this has to be positive because there was this quadratic quantity here. And then we pull out this on the other side. And what we get is just that this is one over 2g of e of x1 minus psi squared. Okay. So now we we come to the result and the result is the following. So I'm writing it down and then we will show that this is enough. So a sufficient condition for convergence with probability one is that the sum of J going from one to infinity of the alpha J's must diverge. And the sum of J going from one to infinity of the alpha J squared must converge. So from your sequence of alpha J's that you chose in your learning rate, you have to match these two conditions. There clearly exist choices which do this. You have to go down and you have to go down. Not too fast because these are to diverge but not to slow either. What is with probability one clearly this condition is pretty obvious. Because when you send this thing to infinity, this quantity you want to be finite. So this guarantees you that this object in the limit is finite. This quantity is also finite. This is just the distance of the initial point that you choose from the zero. And if you are in any finite domain, this is finite. So this is a measure of how large your search domain is in practice. At worst, this is the diameter of your set squared. The sum of the alpha J, you want this to diverge. Just a second. The reason is that if we ensure this and this object is finite, then this quantity, if this sum diverges, then this object must go to zero. Because the reason the other way, if this is finite and this diverges, if this is finite, you can pull it out. This will diverge and this will contradict your assumption. So it's straightforward from this inequality. This is finite and this therefore has to go to zero because this is not going down slow enough to ensure convergence of this sum. But this last bit is just that if this is finite, so this object, you can bound it with some constant capital C. Okay, then assume just that this quantity doesn't go to zero in the limit. So the argument does not converge. Then this object will go to a finite value. If this goes to a finite value, you can pull it out of the sum. And then your sum of alpha J, according to this hypothesis, would be infinite. And therefore you would have that something infinite is less than something which is bounded, which is a contradiction. This is one way of demonstrating this. But what's the idea that it's more important, the more important thing is that the way of achieving this is encoding for something the real need that learning any learning algorithm as. This quantity is related to the variance. This is the bound for the variance. And this quantity here is related to the bias. How much you were far off your initial guess. Okay. So the best thing that you can do here is to try to balance this to you want to learn fast. So you want to reduce rapidly the bias from your initial condition. And you also want to control the variance. And this is accomplished by these two conditions. This one is the one that tells you your learning rates must be fast. You cannot go down too slowly. If you could let them go down too slowly, you will not be able to cancel your initial bias. On the other hand, this quantity tells you your learning rate should not be too fast either. Because it should not be too large either, because if they are too large, you will not be able to cancel the variance. That is to average out the errors. So you see, there's a range of possible ways by which you schedule your learning rates that allows you to get to optimality with probability one. And of course, there are many ways of choosing these schedules. In practice, any choice in which the alpha j go like j to the minus alpha with an exponent alpha, which is comprised between one and two. Strictly excluded. Sorry, one and one off and one. Anything like this works. If you move closer to one of the bound or the two boundaries, you will be favoring your attitude in learning rapidly and canceling bias, or you will favor your tendency to learn. You will be able to move slower, but controlling very well the variance. There's also different degrees of robustness, which we will not have a way to explore. But if you move toward this upper part of the spectrum, so when you're sorry, there's too many alpha, sorry, I understand that this is a mess. J some exponent, which we didn't use beta perhaps. If you move closer to beta one half, then your learning is slower but much robust. It says that you don't have to have very, very precise assumptions. It's really late. Because we've just got 10 minutes left. So, I'm afraid I'm not able to to discuss all the things I wanted to do. So let's just finish quickly on this. So this, this provides the theoretical ground for all reinforcement learning algorithm which work on this notion of stochastic approximation. There have been several improvements and variations on the theme. One limit of all this. The things that I've discussed today is the fact that you have to construct. There are also equivalent formulations for the value function. So you have to construct either vectors for the value function or state action vectors for the quality function. And then you have to keep them in memory. Now, clearly this thing cannot work for things like AlphaGo because just you cannot store all possible configurations of your system. You barely, I don't know, I don't even know. Perhaps not even for chess, for chess you can do that. You can do just this tabular description only for very, very small systems. So the challenge now becomes can one extend this kind of ideas to the situation where you don't have an explicit representation of your state space. So what's the area behind this? You will have to approximate your value function with some lower dimensionality function. And I hope I had the time to explain you the theory of this today so that tomorrow when you will have the first tutorial by Alberti will know how it works. And then there will be another tutorial by Alberti the day after on again on deep learning. But I think I will just tell you how it works in a few words now and then explain a little bit more in detail this notion of function approximation in the last lecture. So the idea is that if you cannot handle your knowledge of your full very big state space, you will have to settle for some function which approximated. So rather than seeking the solution as a state action function itself, which has too many entries, you will write down this state function as a function of states and values with some weights. So you will parameterize your state action function and actually your search will not be any longer contact conducted on the state action space, but you will focus on this parameters. And this parameters by assumption you choose them to be much on a much smaller dimensionality space. Then the question is, well, but how do I represent my quality function? What is the good choice? So for a long time, people just did this in a very artisanal way, I could say. So for instance, one thing that you could say is that, okay, maybe many states look similar so I can group them into a single course grain state. This is also something that you would do physically, right? If you cannot describe many, many small degrees of freedom, you can think, okay, but this is just one large degree of freedom. And I course grain my description of the model is very similar in spirit. But then questions arise. So I do I select, I should do this problem by problem in a certain problem grouping configurations together might be intuitive. In others, it might be less obvious. So I do I group go configurations together? How do I measure the sort of neighborhood of some course green? And then I should do this thing in different ways depending on the problem. So it's it's a highly non-universal approach. One could use other function one could want one thing that has been done extensively because it allows a very for very good theoretical control is to expand this function in a basis in a linear basis. So you expand this function in linear basis with weights and then you come up with algorithms which are pretty much the same that I was discussing here. And this basis function are called features in the jargon. So just like having your very large space and then you just say, okay, but in fact, my, my quality function depends only on a subset of vectors in this space and I can use them and reduce the dimension. But the problem stays there again, because it's one choice might be very good for some problems might not be exportable to other ones. So every time you should be involved in the design of this new feature set. And then it came as a realization actually that that there's one way of approaching this problem in a systematic way. So what are you looking for is you would like to have a universal function approximator some function. Some way of parameterizing which can describe every possible function. And it turns out that one possible way of doing this is through your network, your networks. Feed forward neural networks with just a single hidden layer. And now I'm assuming that you're familiar with the terminology because we had a story on this. So if you have a neural network with a single hidden layer, this is enough to approximate with arbitrary precision, any function whatsoever. It might take a very large number of nodes, but it's able to do the job. This is a mathematical state. If you add more layers turns out, and it's mostly an empirical observation by by this moment turns out that you can encode for more functions with the same number of nodes. If you have a deep architecture that is several layers connected in a fit forward way. So combining the idea of neural networks applied to the parameterization of this Q function leads you to the actual generation of algorithms. And tomorrow you will have a practical demonstration what they are and how they work. And then if needed we can catch up on next Thursday or more on the theory and the ideas behind that questions. Enjoy your lunch.