 Very good. So welcome back. So for today's lecture, the plan is to start moving out from the situation of Markov decision processes where we know everything about our system. We know the transition probability for the states. We know how the rewards are produced, depending on states and actions. And therefore the problem was like I repeated several times, essentially a problem of computation, a problem of planning. So given a model which we assume to be exact, we want to compute what are the best actions to take given a certain objective or the objective is constructed in terms of the cumulative discounted sum of rewards. So you might remember that quite a while ago. I was discussing this diagram of where we put on the horizontal axis our knowledge of the model. And so this is the a priori knowledge that we have. And here we put the observability and Markov decision processes are up there. And so we know exactly what the model of our system is. And we can perfectly know what the states are. And therefore, at every step, we just can, without ever interacting with the environment, we just plan ahead what to do. Now, the goal for the following lectures is to move down along this line that is still keeping as known everything we have to know about the model. So namely transition probabilities and structural rewards. We allow now for the possibility that we do not directly have access to the full state of space space of states. Okay. So this will lead us in a couple of steps to formulate a more general setting, which is the setting of partially observable Markov decision processes or POMDP for short, which encompasses Markov decision processes or extends this notion and combines ideas from optimal decision making with ideas from Bayesian inference. Okay, so that's where we want to go, but it will take us a few steps before we get there. And the first step is to do something, something simpler than that. Okay. But one thing at a time. So first of all, I would like to motivate the need to move along this direction by looking once more at our now familiar examples of reinforcement learning problems and to see how they change if we introduce this notion of partial observability. So first example, Bernoulli bandits. Okay, so just to fix the ideas, let's consider the situation when we have K equal to arms. And this basically corresponds to two coins. Okay, so just to recall you again, what the problem of Bernoulli bandits is, you have two coins. These two coins have different biases. So the probability of seeing head on each of the two coins is different. And the decision making problem is to try and play always the best coin. Okay. So in the MDP setting, so as an MDP, what does it mean? I mean, probability in this as an MDP, as usual, we have to define states, actions, and transition probabilities and rewards. Okay. So what is the state for this system? First, the foremost question. So the state for this system is the biases of the two coins. Okay. So this is what defines the environment. And in this simple setting, so the state is a pair of numbers, mu one and mu two, which since these are probabilities of having head. Okay. So let me express mu i is more formally, mu i is equal to probability that when I toss the coin, the coin i, the observation y high is head or coin i. Okay. And two coins are assumed to be independent as always. Okay. So these numbers are in the square. So graphically, the state space is like this, zero one, zero one. And this is zero one. So if the two coins are fair, both of them are fair, you're just sitting here in the middle. Okay. So this point here, one half, one half means both coins are fair. Of course, this is not particularly interesting as a decision making problem because whatever you do on average, you win half of the times. So there's no choice. The important things happen when there is a choice. So if you draw the diagonal here on the diagonal seat, all the situations in which the two pairs of coins have the same probability of winning. Okay. And for all of them, the decision making is trivial. There's nothing to decide. So this is the state space. So if you take one of these states here, just take one of those, you define it as a state. What kind of actions can you take from that state? So this is the status. You can either decide to toss the coin number one or to toss the coin number two. These are the decisions that you can make. And what is the result of these decisions? Well, first and foremost, by the way we define the problem, the system, whatever decision you make, you take, you will go back to the initial state. Okay. What can happen though is that you can end there with two different possible outcomes. Right. So let me draw two arrows. For instance, if I toss the coin number two, well, with probability mu two, I will get a reward one. And with probability one minus mu two, I will get a reward zero. Okay. I win and I lose. And same here above. Okay. There are two transitions possible. This is probability mu one and you win. And this is probability mu two and you lose. But you go back to the previous state because when you toss the coin, you will always be using the same set of coins. So coins are not changing. So in the state of spaces, this means that let me erase this annoying arrow here. So in the state of space, this means that these transitions are bringing you back to the same point always. So basically the probability of being in a new state as prime, given the previous state A is in every action A is just one if the state is the same and zero otherwise. So in this particular setting, which is called promo stochastic mandates, simple stochastic mandates, the states don't change as a result of the actions. So that's why the problem of multi-ambendance is often called the baby reinforcement learning because there is no notion of states changing or long-term goals. Okay. So it's a very, very simplified notion of decision making, which nonetheless has several interesting aspects in itself. So what are the rewards? Well, we just said that. What is the reward for taking action A from state S? Well, this is just the mu of A. Okay. So because the state is just this pair here, mu one, mu two, which you see, which you're seeing up here. And if I take action one and I win, I get the reward one on average with probability mu of the corresponding action. It's clear so far. Very good. So what is the Bellman's equation for this? So as you can clearly see in this problem, we can immediately set gamma equal to zero. It doesn't really matter if we set gamma equal to zero or any other number, because the process will repeat itself every time. Okay. So nothing is really changing from one time to another. You can realize it by yourself separately. But so let me go one separate time. Let's set gamma equal to zero first and see what happens. So what the Bellman's equation, what does it become? Let me rewrite it in full. So the value of any state, the optimal value is the maximum overall possible actions sum over all transition probability or new states under the transition probability and then the reward incurred in this transition plus gamma B star is prime. Okay. So what does it become here? Remember, we just have a single state. So we can fix our ideas on this S only. So S prime is necessarily equal to S. So for this problem, we start off my state S, which is absolutely just a labeling parameter. Then I have maximum between two possible actions. Okay. So I have maximum between two possibilities. So either I take action one. Okay. So this transition probability here is just one. You remember, define up above here. So this sum is trivial, just fix the S. So this is going to be R of S A. And then if I have gamma equals zero, I can ignore the rest. Okay. So this is maxed over two possibilities. So let me write it explicitly. If I take action one or the average reward if I take action two. So I just have to perform the maximum between these two objects. But this one is just mu one. And this one is just mu two. So if I frame the problem of bandits as a mark of decision process, it's trivial decision making for gamma equals zero. Because I just have to pick the coin, which has the largest bias. Of course, that's obvious. If I have two coins and I tell you the coin on my left hand as a probability 60% of winning. And this one has 50% of winning. And this is true. I mean, the model is correct. The one that you have, I'm not lying to you. Then you would play the coin in my left hand, because this gives you more probability of winning. And that's just it. Okay. There's really nothing subtle in this. This means to say that from the viewpoint of market decision processes, bandits are trivial decision problems, because they have no need to look into the future, etc. Now, you might want to ask what happens for gamma strictly larger than zero. Okay. So let's have a peek at this. So this is quite similar. And therefore, we were going to have v star s is equal to the maximum over, sorry, no, let me keep the same explicit notation in terms of the two actions. So everything is explicit. So this is just the same operations that I wrote above. And I have to take the maximum between two things. Okay. So which of which of the two items in this list under the maximum is the largest. This will determine what the best action is. But now you realize that on the right hand side, this gamma times v star s is the same on both terms of the maximum. Okay. So it doesn't really make a difference. We can pull it out of the maximum. Because if the maximum between two objects, or the maximum between two objects, each of those you added something on top, you're just shifting upwards your object. But it's only the difference which matters. So it's not difficult to realize that this object is also equal to the maximum between r s one. Okay. Now, last step, very, very easy. We just have to move this v star on the left hand side. And then we are left with the general result that the optimal value is one over one minus gamma, the maximum between mu one and mu two. So you see the decision problem is the same here. It's just the pre-factor in front of the value that changes. But it doesn't matter if you have gamma equals zero or gamma very large to one. It just changes the numerical amount of your objective function g, but doesn't change the decision. You will always pick the best coin. Okay. So first of all, it was a super trivial example of the solution of the Batman equation, which poses, of course, no difficulty. But now we ask, what happens in reality? So when we really have to face this problem of decision making, we are not typically told that, so if I want to challenge you, I don't tell you this coin is better than this one. You have to discover it by yourself. And how can you do that? Well, just by playing. Okay. So this is the essence of reinforcement learning. Think about what the outcomes of the play would be. Okay. So notice that in this situation, okay, when I tell you the coins are Bernoulli, you know a lot about the problem. You know basically everything, but the only thing that you don't know is where you are in this state space. So what do you want to discover while you're playing is where your point is? Where are your two coins located? Because if they are here, you would play the coin number two as an optimal decision. Because in this is the region, this triangle up here is the region where the coin number two is better than the coin number one. And vice versa. The point is that at the beginning, you don't know where you are because you haven't played any coins. So unless there is some side information that comes to you, you don't know what to do at first. So you will probably play randomly. As you play, you collect information about the system and the point is now to understand how to exploit that information at its best. So you have to realize that at this stage, this is still a planning problem. It's just that this is a problem of planning under uncertainty. So in your mind as a decision maker, you can construct all possible sequences that will happen in the future. So you can say, okay, let's assume that I see I choose to pull coin one and then this gives me head. Then I pull one head, pull two tail, pull two head. I can imagine in my mind as a decision maker, the full sequence of future events, including this uncertainty. And then I can try and decide beforehand what to do under the possibility of all these series of events. So it's a little bit mind-blowing, the idea that you can actually perform the decision making under such situations because there's clearly an exponentially growing future of possibilities, even larger than the ones that you have to deal with the ordinary enforcement there. But the key essence of partial observable market decision process is how to construct algorithms that allow you to plan even in presence of uncertainty, but with a accurate model of what is going to happen in your hands. Okay. So yeah, is it clear to you what the difference between MDP and a partial observable system is for bandits at this stage? It's exactly the situation in which you are told what the model is in the sense that you are told that these are Bernoulli coins, which gives you a lot of information. Because if I tell you that the coins are Bernoulli, you know the probability distribution with which you will find heads or tails. It's Bernoulli, but you don't know the parameters of that distribution. So you want to control a system in which you don't know certain parameters that govern the system, but you know the general rule. Okay. So that's our first example. Now we will see another two examples to see the problem from different angles. Okay. This first part is very conceptual and so it doesn't pose any technical difficulty, but it's important that you try to understand what the setting is and what the questions are. So second example for us will be Grid World. Okay. So we've seen also this also in the tutorial. So you should have an intuitive set understanding of how the system works. Okay. So this is our Grid. And then we have certain points on the Grid which are not accessible. Okay. So this is the usual definition of Grid World. And you have some rewards placed somewhere here or here. Okay. And then you, as usual, as an MDP, you define your transition probabilities. You define where your rewards are and now large they are. Okay. So you remember that transition probability basically, these objects here basically describe the kind of things that you can or can't do, starting from a certain position. Okay. So the allowable transitions and with what would probably they occur and for the rewards here, well, that's actually where they are. And maybe you can have other penalties around. Okay. So you can construct this system as you wish. Okay. So and you've seen that when you are, you have knowledge of all these things, then clearly you can use value iteration to produce an optimal solution of this problem, which will tell you from each point, which is the best action to follow. Okay. So you remember that you have sort of a graph like this, where there were the optimal actions to take to go from any given point. And they depend on gamma, etc. So what is the equivalent of a problem of partial observability? Well, the problem would be, for instance, that at every given time, the decision maker, which is our small robot, which is moving around in this, say, suppose that it's occupying this tile here, it is here, but it does not know where it is with that precision. For instance, the sensors of the robot or the GPS or whatever, any localization system that you have, maybe tell you that you are inside an area of a certain radius. Okay. So typical situation is that location has some errors. Okay. So your positioning in space is not known perfectly. But you know what kind of errors you make. Okay. So again, the knowledge of the model means that you know, for instance, if you are in a certain state S, what is the probability of making an observation Y? Okay. So this would be, for instance, probability that the sensors give position, let's say measure position Y when real position is S. Okay. So this is what is known as a observation model. Now, in general, this observation model might also depend on the actions. Okay. For instance, your GPS might work better if you are not moving rather than you're moving. Okay. So the quality of these observations, the size of the errors that you make in the location might depend on the actions that you're taking. It also might depend on the state where you start or the state where you end up. It might also be written as a sprung. So these are more of convention matters. Okay. But what is important here is that this kind of notion is also known to the agent. So the agent has also a model of what will be the outcome of the observations, even though they are inaccurate. In the previous situation, the one of the bandits, basically observations and rewards were the same. Okay. So the knowledge was a little bit conflated, but in general, these are two different things. Okay. Observations are contextual information and the rewards are rewards, but in the bandits problem, these two things sort of come together very close. Okay. So the question is, even if your robot has a crappy GPS, can it be controlled in the best possible way, given that you know what are the kind of errors in activation that it makes? Okay. So you know once more, you know what kind of outcomes will be the consequence of your actions. You have a map of the environment. So you know where the rewards are, you know what the stumbling block blocks are. And you know how your GPS works in detail. So you know also this. Can you plan in advance, given all these uncertainties, what is the best way of deciding? That's the question of partial observability. Is this clear so far? Any doubts? Okay. Yes. I'm sorry. I have a question. So basically F is the density over the state space. In this case, yes. In this case, it's these two objects states and observations are of the same quality, if you wish. Okay. Okay. And the action, instead of why we have the action in the density there, in the F? Just because, like I said, the type of observation you make might depend on the action you take. The example I was giving is that suppose that your robot stands still, maybe the precision, the variance of the observations or positions are better, for instance. Okay. This is something that happens typically with GPS. If you stop, then you get a better signal and then your precision is higher. But if you use your GPS while running, maybe the precision is lower. Okay. So that this just, it's not the, it's just that the quality of the observations may depend also on actions. Sometimes it depends on, sometimes it depends, it depends. It's something which is model, situation dependent. But in general, you can add this dependency in your observation model. Is that? Okay. So basically, we think of this reading, say, of a function to have a given value, depending on the last action that we, well in this, because in the case that you're talking about, say, that the robot is having an error in position while moving, we have this notion of time, say, okay. So in this case, I thought that the function, so this function I value is the same value at any given situation, state and action combinations. Yeah. Yeah. So this is a, and this is a random object, right? So observations can be noisy themselves. So the same object for bandits, okay, would be just the probability. So let me write it here for clarity. So in this case, what would be the probability of making an observation given a state and an action? Okay. So the observations here are just zeros or ones, head or tail, okay. And these are Bernoulli. So remember that the state is mu one, mu two, so the two biases. So in this case, this would be, if I take action a, this means that I'm flipping coin a, which can be either one or two, okay. So this is going to be mu a to the y times one minus mu a to the one minus y. This is the Bernoulli distribution, okay. So if y is one, the probability is mu. If it's zero, the probability is one minus mu, okay. So this is an explicit form of your model of the environment. And you know that you know what form it has, but you don't know what the mu are in general, okay. And here is the same. You know what form this function is, but you don't know what the s primes are, which you don't know some of the parameters that are present in this distribution. Okay, thanks. Sure. Very good. So a third example, we're going to have a look again at our cart pole problem. So you remember it's a cart on a finite domain and it has a pole on top of it on a hinge. And the goal is to balance this pole. So you want to apply some forces to the cart in order to keep the pole as much as possible in the vertical position, okay. So as an MDP, this is a problem of control, okay. So let's assume that we know the model. So we know what the degrees of freedom that describe the system is. So let's make the assumption that the simplest assumption is possible for the mechanics of the system, which would be that the fact that we have a state space which is made of horizontal positions and momenta and of angular coordinates and momenta, okay. So this is my phase space for the dynamics of this system, which is made of two rigid objects connected by a hinge, okay, in one dimension. And then the equations of motions for this system, so these are the degrees of freedom, let me call them Q. It's not the proper name, but in fact it's so bad that I immediately want to change it. So let me call it something like Z. Z dot equals some function of Z and of U, where U are the controls here. So this is the deterministic version. You can add noise if you wish to the system, plus noise if you wish. So what does it mean to know, knowing the model? Well, you know exactly what this F is. So you know that you are applying the laws of the Newton's law, okay, so accelerations, angular acceleration, they are connected masses and with moments of inertia, you know all of this, okay. So you can write down exactly the dynamics of the system. And if you cast it as such, this is a problem of optimal control in engineering, so you try to solve the Bellman's equation and you may want to do it in continuous time and space. I didn't explain you that, but there exists a formulation of the Bellman's equation also for this case, as you can probably easily imagine. But now, of course, in such a system you're confronted with the first difficulty, that is the fact that your state space again is continuous and here is four dimensional. So when you want to solve the Bellman equation, I mean you have a problem and what you would naively do is just to use some discretization, for instance, which would amount to the capacity of resolving your system on spatial scales and also on temporal scales, okay. So just the sheer fact of having a continuous state space means that you probably are very unlikely to be able to measure all these coordinates with infinite precision. So just this simple fact that is the limited resolution, this implies partial observability. So this might be a problem which may be very severe or not depending on what you are actually able to observe. So in this case, if we say that, okay, we can measure positions and velocities every, say, tenth of a second, maybe that's enough for my system to be controlled properly, but maybe it's not. If the system is too fast, I will not be able to react appropriately. Okay, so this is another example in a situation where there are limits in your capability of observations or you might also consider, as previously, the fact that you don't actually measure X, but you measure some noisy version of it. Okay, so this noise comes from the instrumental, okay. So it's noise in the measurement device when you try to take a photo. There's some blurring of the image, so you don't know exactly where the point is. Okay, so these kinds of things are quite normal. And what we want to do here is to develop a framework in which we can deal with situations where there's a small noise, but there's also large noise. And even in extreme situations where some degrees of freedom are inaccessible at all. So maybe in a system which has many, many degrees of freedom, a robot which has hundreds of degrees of freedom, and maybe we want just to observe a small number of them. Can we still control our system? Can we still obtain the best of our performance given the partial observability that we have at hand? And this is the question that we are going to address. So are there any questions so far? Is the conceptual setting clear? All right. If this is the case, I think it's a good point to stop, take a break, and then we start again, say at 10 sharp. Okay, see you soon. Okay, so for this second part, we will look at some possible very straightforward way of approaching this problem of partial observability, which will not be our sort of end point for this journey, but it's going to be useful because it will introduce some ideas that will prove themselves important also in the following. So a very straightforward idea to deal with partial observability is to resort to ideas and concepts from a function approximation. So to fix the ideas, let's consider the situation of grid world. So let's say that we have a large grid world environment which has many, many tiles. Okay, I'm not drawing them, but let's just draw them in the corner just to give you an idea. So I have in mind something which is very, very fine. Okay, just a very small granularity like this. So you have potentially a very large number of these states in the system. And in this object here, you have some obstacles as usual, some of them large, some of them small. Okay, and then there's the structure of your rewards and say, for instance, rewards are located somewhere around here. And as usual, the task is to, for instance, reach the reward in the shortest possible time if it's a goal state. Now, one of the basic ideas that combines the notion of partial observability and function approximation is that you realize that you don't actually need all this huge number of states to describe the system. Okay, so and therefore, when you construct your robot, you don't need necessarily that your robot is able to resolve distances like, I don't know, one millimeter. If your robot has a size of a usual vacuum cleaner robot, which is 30 centimeters, its capabilities of locating and moving need not be, for instance, of a micron, right? Even though, of course, it makes sense to sort of describe space with a finer possible grid, maybe it's just not necessary. So you would like to combine these two ideas. Maybe I don't need such a fine description of my state space. Maybe I'm not even able to observe it properly. So these two things come together, partial observability and approximation come together. And this, for instance, does the idea, okay, maybe in a system like this, maybe I don't need to define states as small as this, maybe I just need to define something like a grid, which maybe is very large here. Because I expect, for instance, that all good actions there will be the same. So typically, if since the reward is at the top bottom, maybe all actions there are pointing on one side or down, but I can lump them together into a single superstate, if you wish. Okay. And then maybe when I get close to the obstacles, maybe I need to refine my grid because here it matters what I do. And then I can maybe have it larger again here. So the general message is that unless where it's strictly necessary, you don't need to have neither a specific, very fine observation of your state contextual information. And maybe you don't even need that in the computation stage. So you might simplify your solution of your optimality equations by a suitable coarse graining of your state space. Okay. This is one example, one motivating example for why you might want to modify the structure of your state space in order to account for a limited perception or limited the precision in the description of your environment. Okay. So as a matter of fact, you can, this is what is called the tiling. Okay. So use a set of tiles in order to describe your state space. This also goes out on the name of state aggregation. You lump many states together and call them a single superstate. Of course, you have to realize from the beginning that you are doing a lot of violence on your system. Okay. Because if your system was Markovian, in the sense that the probability of transition from one true state to the next one were obeyed by a Markovian process, transitions across macro states need not be. Okay. Because there are many ways of transitioning from this style to another one through different points in the boundary. And when you combine them all together, you are sort of violating the Markovian property. Because the probability of transitioning from tile A to tile B do not depend any longer just on where you were in tile A at the previous step. Also depends on microscopically where inside the tile A you were. But this is an information which is lost. So this operation of making functional approximation and partial observability, one has to be careful with because if it's applied in a too straightforward way, it breaks Markovianity. And therefore you are using a model of the environment which is not as good as the original one. Okay. So this is just a word of caution in saying that all these operations which look very natural still have a potential for giving great damage. Okay. Okay. So of course when you go to the continuous work, so robotics where states and actions are continuous, this operation of approximation becomes a necessity. And maybe you don't want to do things as crude as tiling, but you want to do something smarter. For instance, one possibility is to, so suppose you have a motion of a longer line. And rather than partitioning your space into blocks. Okay. So this will be tiling in one dimension. So what does it mean? It means that when your robot is here, you say that it belongs to super state i. When it crosses this boundary, you say it belongs to super state i plus one and so on and so forth. Well, in fact, the actual state is a continuous variable, the horizontal position. But rather than doing that, you want to maybe want to do something better, which is suppose you soften a little bit these objects. So you make them, rather than making them boxes. Okay. So your indicator function here i is equal to one if x is in box i and zero otherwise. It's a box function here. And you have a collection of boxes, one beside the other. And now suppose you want to soften a little bit these boxes and you turn it into Gaussians. And therefore you might want to say, okay, I want to use a different set of indicators, which are Gaussians and maybe they overlap a bit. So this is a way of making the transitions softer, gentler. Yeah, so it's more graceful as an approximation method. In this case, your functions here would not be box functions, but would be just like function phi of the position, which are, for instance, could be small Gaussians over certain width lambda x minus and center. That's some set of prescribe the locations xi square. Sorry about that. Okay. So this is a technique, which is called as many names, radial basis functions. But more interestingly, you can define these functions as receptive fields, which are names which actually comes from neuroscience and from vision in particular, in the sense that you might think that these objects act like neurons, which respond to signals that come from a certain spatial extent. And you can understand this as a way to describe space on a coarser, with a coarser scale than the actual one. So in general, let's say that if you have a certain state space, you might be interested in finding out a set of functions with some indices alpha, which map you into real numbers, which could be, for instance, these box functions or these Gaussians here, these receptive fields. And with alpha, which is a collection of integer numbers, which goes to some cardinality F. And maybe you want to consider a situation where F is much smaller than the number of states. Okay. So these functions here, which are real functions of the state space, are what are typically called features. And it's not unlikely that you have encountered this concept before in other domains of machine learning or other things. So the basic idea that is lying, underlying the notion of features is that you have some high dimensional space, for instance, R2 power s. Okay. And then you're taking a linear subspace of this, which is spanned by these feature vectors. So the idea behind this notion of features is very simple. So in this case, for instance, in the case of one dimensional space, you say, okay, I want to approximate any possible function here in terms of these boxes. So this, what I'm doing here basically is I'm making approximation as crude approximation in terms of like step functions. And if I use Gaussians, I can do something, something better. This will be a smoother objects. And then I can use wavelets. I can use any kind of basis functions I like in order to obtain an approximation for every function whatsoever, given my state space. And all of these are linear approximations, because they combine different function sets, different basis functions in order to construct a possible function by combining them linearly. So what I'm saying here is, sorry, what I'm saying here is that I can write any function f of my space x as a linear combination over certain weights of feature functions. Okay. So is it clear so far, the kind of ideas we have to deal with functions defined on a large space, which it always should be in the back of your mind for us in reinforcement learning is the state space or the state action space. And we want to find ways to describe it in a way that it's approximate. So what is the connection with partial observability? Well, you might think, for instance, in the case of receptive fields, each of these Gaussians, you can interpret it as the probability of measuring the position of my system, given that in fact, I am here. Okay. So this is just like the likelihood of a certain observation in space, loosely speaking. So that's where the connection between observability and function approximation, linear function approximation here comes. So you can interpret these features as observations, in fact. So given a state, you extract a real number for a set of observables, which tells you how much that observable is waiting in your representation of your state space. Okay. So that's where the conceptual link happens. So in our specific case, what we want to do is want to specifically use this linear approximation for value functions. So what's the idea? The idea is the following. Our value functions live in a dimensional space. So for instance, in this case, suppose that you have three states. Okay. So cardinality is three. So you have a system of three states, like a decision process, and you have two actions, two actions here, which can send you here and here. So I'm drawing arrows at random, actually, too much at random. Okay. So you can have several actions from each state. They don't really matter. So I mean, we move this just confusing, but I have three states. So my value function here in this context is a point here. So if I pick a policy, you remember I can compute the value function for that policy just by linear algebra. And this will be a vector in my space. So this will be the value from state one, value from state two, value from state three. And what linear approximation does is that it considers only values which live on a lower dimensional subspace, linear subspace. In this case, for instance, a plane. Okay. So this is the plane which is spanned by the two vectors phi one and phi two. Okay. So I have two features here, alpha can be one and two. These are two vectors in state space because each of these has a component depending on the coordinate x. Okay. So phi one is a three vector example. Let's make it an explicit example. So for instance, my phi one is, phi one of component one is one, phi one component two is one, and phi one component three is zero. And then I have another feature which is phi one one, sorry, phi two one, zero, phi two two, zero, phi two three equals one. So these are two vectors which are independent. They don't span the full state space because they're just two of them. And what do they correspond to? Well, the straightforward interpretation of these two vectors is that this feature, this first feature here doesn't distinguish between one and two. Always returns one, whether you are in state one or in state two. And the second feature is one only when you are in state three. So in practice what these features are doing is that this one is just telling you, you cannot observe if you are in state one or two, if you are modeling your value function like this. You are aggregating state one and state two. Because your value function will not be able to distinguish between if you are in state one or state two if you express it as a linear combination in terms of this. Okay. So there will be only two parameters. The parameters w one will tell you something about the value of states one and two confounded. And on the other hand, the second feature is perfectly able to distinguish that it's in state three. Of course, if you have enough features and they are linearly independent, then you can basically disentangle everything. So for instance, suppose that you have as many features as state. Okay. So the situation in which the number of features is equal to the number of states. And you can choose as features just phi alpha of S equals one if alpha is equal to S and zero otherwise. What are these features? Well, these features here, let's draw them in, let's circle them in. What are these features? Well, they are nothing but the unit vectors along the axis. Okay. So in this case, your features are just reproducing the normal structure of the state space. You've done nothing here. It's just as it was before. But in general, you may be interested in having lower dimensional descriptions or you might be forced to do. Okay. So you might be interested in having them if your state space is large dimensional and you want to compress it, or you may be forced to do that if your observations do not allow you to disentangle between different states. Okay. Okay. So when the architecture of the system of this approximation, so this is a sort of a keyword. So this is a so-called linear approximation architecture. And the natural question is, can we adopt techniques from dynamic programming to the situation where we have linear approximation? So you might remember dynamic programming is the set of algorithms that allows you to solve the Bellman equation in brief. So for instance, you might remember that we have been discussing a lot, including the tutorial about value iteration. So a natural question is to ask, can you modify value iteration in order to make it work with combined with function approximation, with linear function approximation? So let me first remind you what value iteration was about. You remember, you may remember that. So let me draw now an even simpler graph, in which there are just two value functions. Okay. Two states system. And the optimal value function is a vector V star here. And you remember that the value iteration that is the iterative algorithm that, sorry, the iterative algorithm that says you that basically the approximations of your value function of your optimal value function are created by iterating the Bellman's operator until some convergence criterion is met. And since this is contracting, this means that if you start from any guess here around, okay, so this is a, let me call it with parentheses so you don't mix up with the components. So this is my first guess for a vector, then my Bellman's operator tends to bring me closer and closer and closer to the optimal value. This is the essence of value iteration. But now we want to combine this with the linear function approximation, which means that our value functions lie on a lower dimensional subspace, like a line in this case. So I would like to find the best approximation to the optimal value, which lies within this subspace. So in some sense, I want to find the closest point along the line to the optimal value. If I, the goal is, if I find that, where I have to define what close means, if I find that, then I can derive a policy from this approximated value function, and this will be an approximated policy for my system. Is the plan clear? So how would that work? Well, the idea is actually very simple. So suppose you start from your guess, which now is somewhere here along this line. Remember, all this is graphical representation, but everything usually takes place in very high dimensional spaces. So if I apply the Bellman operator, in general, this will send me somewhere else, like here. But now I'm not very happy because the point has left my space. So what should I do then? Well, project it back. So I have to project this back onto my space, and then I get up with a new guess. So this was a temporary guess obtained by applying the Bellman operator to my first guess, but then I have to project it back to get my second guess. So and so on and so forth. So this will tend to send you there, and then again, and then okay. So there is this back and forth. So that in a nutshell, the idea of the approximation is that your new guess is what you get by applying the Bellman operator to your previous guess, and then projecting it back. So this is a projection operator. Okay, this is all nice and clear, and then troubles arise. First of all, we have shown that the Bellman operator is contracting in L infinity norm, whereas all these kind drawings that I made are in a sense in L2 norm. So everything has to be changed accordingly. So this means that, for instance, the closest point to V star, I should not compute it by looking at the bowl, like in the Euclidean distance. I should look at it by looking at something like a square or an hyper cube. So as a matter of fact, the closest point to V star on the line is something like here. First thing. Second thing, if I use the L infinity norm, also the projection is a little bit different. Here I was kind of drawing things in the sense of Euclidean projection, like sort of taking the closest point along according to distance L2. I should do that according to norm L infinity. Okay, so there are all these sorts of nontrivial technical things. So when you mix up different norms, there might be different performances of your iterated approximated value iteration, which is the thing we are doing here. This is approximated approximate value iteration. So there are sort of technical issues and also other kinds of issues because here, so the idea, I think it's quite intuitive. But when it comes to writing out down plainly what you are actually doing, then you realize that it's not so obvious what you're doing here. Why is that? Because in the end, when you're happy with your approximation, you want to derive some policy. So your policy at a certain step K will be defined as usual according to the greedy choice as the R max over a sum over S prime, S prime, and this was the approximation step K plus 1. So the thing that you realize when you want to extract this policy or what you want to do actually in the projection operator is that you still have to do things in your big high-dimensional space. And your policy will depend on the states if you extract it like that. Sorry, I don't know why it's not writing it along. Pencil is running out of power. So I did charge for a second. So all these operations can be computationally very heavy for first. So this procedure that I described here is theoretically sound in the sense that it brings you as close as possible to your desired point. But it may require a lot of calculation because you still have to compute over transitions over all possible states, real states, true states. And your policy depends on many of your sort of microscopic degrees of freedom. So although this is conceptually sound, it's not very useful. So what you would do instead is to find the policy which depends only on the coarse grained objects. So you want to decide things on your group states, on your super states. And therefore this suggests that you could sort of modify this approach and say, okay, but let's do this by picking, for instance, just choosing only for s belonging to some representative states. Okay. So let's suppose I take some state which belongs to my tiling. Okay. So only for, it's difficult to write this down in a way that is proper and specific enough. But let's say, only for specific states. Okay. So let's go back to our example with grid world. Okay. So for instance, in this case, this is a linear approximation of the very function that you construct, right, because each of these styles is a feature. Okay, like the boxes that I was drawing here, only that they are geometrically more complex. So every time that you have to compute a policy, according to this rule, you should compute it in every single point here, which is a bit of goes counter to the idea that everything inside this type should be the same. Right. So you have to sort of either average out among all these possible states, which is a possibility. So you obtain many policies and you sort of derive an average policy for all of them. Or you pick some representative states and you say, okay, I will evaluate the policy here and this policy will be valid for all points inside my time. You have to do tricks like this in order to get this thing to work effectively. But the sudden use is that if you do that, your algorithm that was nicely converging to something meaningful need not do so any longer. Okay. So if you truly follow this idea and use only specific states or average policies, in general, you lose convergence. So these techniques that combine functional approximation with value iteration have their own limits. So one has to be very careful about what one is doing. And therefore, they require usually a lot of insight about the proper choice of features, about how to make all these kinds of approximations, etc. So we don't spend a lot of time on this because this is a sort of a relatively side theme in reinforcement learning. So, broadly speaking, very few people do that at the current time as a technique to approach reinforcement learning problems. But nevertheless, they are quite important from the conceptual viewpoint and for the kind of pitfalls that they may produce. So this was one important remark that I wanted to make. So as a matter of fact, these kind of ideas can be expanded from the notion of just linear approximation. So you could replace this projection operator here with something more general. Okay. So in general, you can replace projection with some approximation operator. Okay. I will give you an example in a second. And you can think of this generalized algorithm in which you create new approximations by applying this approximation operator to your Bellman operator, to your previous guess. So what is the idea of generalizing to something which is not a projection operator? Because this allows you to go no linear. So what is the example? The example is, for instance, rather than selecting a linear subspace over which you approximate your value functions, you choose a no linear approximation. For instance, an artificial neural network. So what is the idea? The idea is that now you don't use a linear approximator here. You use something more powerful like a neural network, which you can deform as a function of the parameters in order to make it come closer to your desired point. Okay. So formally speaking, this amounts to using a different approximation operator. In practice, what do you do? So you choose some representation of your value function, which is a high dimensional object in terms of parameters, which are the weights and the offsets in your neural network. Then you apply the Bellman operator. And then you try to reduce some loss function between your Bellman's operator and the class of function that you described. Okay. So I know it's not extremely clear, but I don't want to go into these details because they are not very relevant at this stage. This is just to tell you that it's possible to go beyond the linear approximation and combine no linear approximation with dynamic programming. And this has been a subject which was very hard in the 90s with books about this. And some relevant algorithms, for instance, for the play, for playing the backgammon in the very early days were developed using this kind of techniques. You can find the story on the Sutter and Bartos book. But like I said, it's something which did not really age well in the sense that these ideas are combining no linear function approximators like the artificial neural networks actually displayed all of their power when they are combined with model-free methods rather than with MDPs or model-based methods. And this will be something that we'll be discussing later on. So this was just to tell you that the combination of neural networks and reinforcement learning has a much older story than one might think on the basis of the recent impressive results. And it's already incorporated in the notion of dynamic programming. Okay, so this is more about sort of mathematical things that you can do. But the most important take-home message for the moment about the function approximation that I want you to really to keep in mind is that function approximation is inherently limited. Okay, so I will give you one example where it really performs horribly. And we will see what is lacking to this situation and what we need to incorporate that is absent from this scheme at this time. So and the situation is the following one. Let's go back to our cart pole. So let's imagine the following situation that we're really having a real cart pole system. And we have a camera and we are taking images of this cart pole. And these images are spaced one second apart. So I take a snapshot on my system then after a second I take a snapshot and so on and so forth. And then I have a powerful deep convolutional network which extracts any kind of feature from my images. And I ask this convolutional network to produce me the relevant degrees of freedom to control the system. Okay, so what I can expect is that all this device can compute very effectively my location from the images. It can compute my angle. And so the question is given this sort of projection if you wish of my state space into this coordinates can I control the system? Well it's difficult to say but you can realize that if you're in this situation and your pole is falling down you want to push it on this side along this arrow in order to compensate. But if it's going up you don't want to do that. So what is missing here in this idea of taking a snapshot of my system is that I don't know anything about x dot or theta dot. And if my images are spaced one second apart maybe I'm not able to extract this information. So the big issue at the formal level with function approximation is that function approximation works like this. You have the state of your system. Okay. And from your state at time t you extract some observation y t and then from this observation you choose your action at. And then these two send you to a new state sorry s t plus one. So and then again and you repeat. Okay. So if you just use function approximation to describe your system what you're doing is you're only looking for reactive strategies. So such a strategy here is called the reactive strategy. It only cares about the kind of observation that you can make at that time. If the observations are rich enough your reactive strategy can be working well. In particular if your observation here y t is exactly the state of the system then it's fine. Okay no problem. But if you're observing your observation are severely limited then you can only have such kind of strategies. And then again in our example here if I observe only position and angle I'm severely limited in the kind of decisions that I can make. So how do I go about this? Well for instance I say okay I need to take a movie with a with a time resolution of 0.1 seconds. And if I have positions and angles every 0.1 seconds I can derive velocities from those. But what is this process of using two different time frames implying? It's implying that I have to have some memory. So my strategies cannot be reactive. I have to keep in mind at least the previous values of positions and moment. That's what that's quite straightforward. Okay nothing very complicated here but conceptually very important because we have to move from this setting to a new one in which states give out informations. But these informations are stuck into a memory and then from the memory we pick an action. And now at the next step what will happen is that states and actions will produce new states. And the new sorry this one is already went there in my graph. Let me correct this. So in my new memory we receive a new observation which will be fed into the new memory and also the action is fed into that. And this gives rise to a new one. So what is changed here is that the presence of a memory allows me to have a record of information. So I can use memory of past observations in order to reconstruct what the current state of the system is, even if I don't observe it in full. So I can derive velocities from positions by using memory. And this is something which besides this very simple example is very important. So from the goal of our next lecture is how to incorporate these sequential observations into a mechanism which will be for those of you who have already encountered it, Bayesian updating. And use this inference procedure to construct effective decision making in presence of uncertainty. This is a thing that function approximation alone can't do in general. So I think that's a good point to stop. It was a lecture with not much math but lots of ideas. So I hope that all of them reached you somehow. And if there are any questions, I'm happy to take them. Otherwise, then I will stop sharing and stop recording.