 Please come with five minutes of advance. Thank you very much. So today we're going to start with the real thing. So first in setting up the nomenclature and what do we mean when we talk about certain objects and notions that are widespread in artificial intelligence and in reinforcement learning in particular. So you can interrupt me at any time to ask questions. Just please use the microphone so the recording will include your question. Otherwise I have to repeat the question and et cetera. This is for the sake of people who are not in measure of attending personally this set of lectures that might be interesting. I mean, I look at them on the web, on our channels. So before starting, I just wanted to ask a general question. So who would be interested in having sort of one hour, one hour 30 general discussion about interactions between artificial intelligence and ethics and legal issues and the future of work and all these things? Please raise your hands. OK, good enough. We can have this. We'll discuss together when there is a good time slot over the week in order not to be too much pressing on you to yourself to study. So first things first, from the examples that we've been discussing yesterday, we've been seeing there are some distinctive features that emerge. So one general idea that we cannot escape is the question, what is intelligence? Because we want to develop algorithms for artificial intelligence for machines and algorithms that do things, see things, perceive, et cetera. So we have to distill a notion of intelligence. And it's very difficult. So that might be a very wide spectrum of definitions that you can adopt. We're not going to enter into this definition. The history of artificial intelligence, which actually started in between the 50s and 60s. So there is a beautiful book. OK, I will then provide you the references for background material. But there's a beautiful book on the history of artificial intelligence that I suggest for you all to read. At the beginning, so at the beginnings of the history of artificial intelligence, one notion of intelligence that was adopted was the ability to perform logic reasoning. So making logic connectors and constructing out of logic the ability of reasoning and doing intelligent things. That was a dead-end road. If you wanted to draw a similarity, that would be just like try and understand the way this chalk splits into starting with quantum mechanics. You have to adopt a different level of description that is not just logic combination of things. Because you don't go nowhere. You don't go anywhere. It's just like you might remember from the history of mathematics when Bertrand Russell wrote books, aesthetics like this, trying to derive the axioms of computing with integer numbers starting from logic. So it's a long path. It doesn't mean that it's not a worthwhile problem in mathematics and foundation, et cetera. It's just something that it doesn't work in practice. So you have to adopt a different level of description if you want to implement intelligence. So I'm asking to you. So let's have anyone wants to contribute with his own understanding of what intelligence is? What makes intelligence distinctive? Hability to distinguish different scales and to order them. Different scales and to order them. And to do some hierarchical inference. OK. You've been very much biased recently by recent experiences. OK. Yeah. So I'm looking for definitions which would work for machines, animals, humans. So this is a definition about scales. Then do we have anything else here? Maybe the ability to gauge one own behavior based on the experience of the surrounding environment and the ability to make prediction based on that. OK. Good. So there are two different things here. So one is evaluating your own behavior. So getting basically some feedback from the environment which says you're doing right. You're doing wrong. To some degree, which might be a very complete feedback telling you, oh, yeah, you've been doing exactly the right thing. Except at that particular time, you should have been doing this and et cetera. Or a very loose feedback. Like, OK. And then the other thing is perception, you said, right? OK. Yeah, just a second. It's quite similar to what you said in the sense that given also that there are several kinds of intelligence is the ability to perceive the environment, process it at some level. And then being able to use it to obtain a goal or to make predictions that are useful to adapt and adjust one's behaviors, for example, in logical intelligence or artistic intelligence. Yes, another key word that comes up is adaptation. So the ability to react to change in condition in the environment, which is also crucial. Another contribution here. Well, in the same line that you're talking, find patterns in order to make predictions about the environment. OK, fine. But this would mean, I think, that construct a representation of the environment, which is useful, perhaps according to some predefined classes or not. That depends. Last one, then we move on to summarize. I think intelligence is I give you the axioms of number theory. Or sometimes I give you the axioms of algebraic geometry. And then out of them, you can prove Fermat's last theorem. If you cannot do that, then it's just artificial. OK, I don't quite agree with this last definition. This is more of the deductive intelligence. You start with axioms and then you're able to deduce the consequences. This is pretty much the thing that I was saying didn't really lead to any substantial improvement in practice. But that's one definition that has been adopted in the past. And just let's try to distill all these ideas. So what the community of people working in artificial intelligence agrees upon is the following definition of intelligence. So first of all, let's try and identify. So you've been saying many of these things. We're just putting them in order. So first thing, there has to be, in order for an intelligence to be, there has to be an agent and an environment, which are two separate things. The agent is the one that perceives, acts, thinks, plans, and the environment is whatever is outside this agent, which might be the environment that is everything that is outside, including other agents. So this is a very general definition. But this is key. So the basic idea is that there is an agent. If you can't read from the back, tell me. I will just increase my font. There is an agent, and this agent leaves in an environment. For the single purpose, I would draw another box. But of course, the environment is everything that is around the agent. Then what kind of exchanges are there between agent and environment? So one is perception. So the environment sends signals to the agent. And these signals are called percepts. We will see later on what kind of things they are. We will go more in the detail of this. How does the agent process the percepts through a set of sensors? So there is an interface between the agent and the environment, which I drew here, which is made of sensors. Anything that is able to perceive the environment. So in AlphaGo, what is it? There's probably a camera on top of this that gets the current configuration and pass it on. It's able to segment and this, or even simpler. I don't know. There might be sensor on the chessboard, which tells where each piece is. That's one simple interface. For robots, plenty of interfaces, visual, tactile, the mechanical, some robots have artificial noses to sense chemicals in the environment. Anything that is akin to our senses, the way that for humans, for animals, the senses are what conveys information from the environment to the agent. And inside the agent, there is this box, which might be the brain or the algorithm or the hardware that makes the calculation, et cetera. But after processing this information, the outcome is some action. So the agent does something. And in general, the result of this action changes the state of the agent with respect to the environment. It may not be always the case, right? But if we want to encompass this in the most general case, actions modify the environment. So for instance, in a task in which I am the agent and I have to go on the other side of the room, what I do, I look around. I see there's a step here. There's the desk there. I take a step. So sensory, motory. I try to connection. My state with respect to the environment has changed. Because before I was, I don't know, one meter away from the corner of the desk. Now I take a step. And I'm just 10 centimeters away. So my relative position to the environment has changed. So this is the most general way of thinking about these things. And then this process goes on and on and on and on. It's a repeated interaction between the agent and the environment. So this is the most general definition of, so just one second, the layer which transforms concepts, ideas in the brain or in the algorithm into real actions that produces motion, for example, is called the actuator. So how these things are done, the sensor and the actuator, is a matter of engineering for a robot or for a machine. So we're not going to focus on how to do this and how to do that. This will be abstract processing units that transform the input into something that is managed by the algorithm. And then the algorithm, again, outputs some instruction for the actuators, which eventually turns into some change. So in the mind might be that the agent says, I want to go south. But again, the actuators, which are my legs, my whole body, might implement these things in a way that is not exactly south. So there is also this difference between what is actually the instruction coming from the agent and how it is actually implemented in practice, which may vary. So this is just to distinguish logical layers of this process. So there are different kinds of learning tasks which can be considered. And you might have heard about them. So it's important to distinguish from the outset what's the difference between the three great categories, which, of course, their boundaries are a bit blurred, but they are very helpful in categorizing things. So you might have heard about unsupervised learning, supervised learning, and reinforcement learning, which would be the subject of our discussion. So what's the difference between the three? So anyone knows what unsupervised learning is about? Yeah, please pick up. Sorry, I got to give you the microphone. Basically, in supervised learning, you first. Unsupervised? Unsupervised. OK, unsupervised learning. You just let the machine learn by itself, basically. I mean, you give the rules to the machine. And basically, you define like a function to optimize, basically. Do you? I'm not sure exactly what unsupervised learning does. You wanted to? I was not sure about supervised learning. Go for supervised. Supervised learning. Basically, you first provide some examples to the machine. You see what the outcome of the machine is. And based on the outcome, for example, I am thinking about artificial neural networks, you change the weights between neurons until you reach, I mean, until the machine is actually capable to return the outcome that one wants. OK, that was a bit complicated, but let's try to distill it a bit. So these two big categories. So unsupervised learning is what a machine can do if it's offered examples, data. And this data come with no particular information attached. And the machine has to discover whether in the data there is some correlation, for instance, whether in this amount of data, actually, which might be, in principle, be very large, there actually is some small dimensional description of the data which works. But this is something just like that works by fitting examples without any particular information from outside. The other extreme of this, which is supervised learning, is that every example that is given in the beginning comes with a label which says, this is an image of a dog. This is an image of a cat. And you give many, many examples of that. And then by some technique, including the ones that your colleague was talking about, the machine learns how to, for instance, classify things by perhaps reducing some error function. So this kind of learning techniques can be still incorporated in this description. So let's start with unsupervised learning. Unsupervised learning means that you're giving to the machine images of dogs, of all kinds, without any label attached. And then the machine has to learn that there is some commonality between all these things. Of course, it's a very long and painful process. And you need a lot of data for this to converge. So unsupervised learning is a very difficult task in general. Unless the data are very structured, then you just, at first sight, you don't know because they are very complicated to look at. But they have some very precise structure in them. And then you will be able to extract this correlation between the data that you have. But unsupervised learning, basically, you can think about it as just this one-way process. So data are coming in. The agent sees them and process them. And then, basically, the process is stopped. Because it's just the one-directional flow. The data are coming in. And then the mind, the brain, the machine that tries to mull it over and extract some information. But there's no returning to the environment and changing the things that the machine does according to the data we get. This kind of adaptive behavior, typically, is not a feature of unsupervised learning. Yes? Alpha go zero was totally the opposite of if there was no supervision at all. Totally unsupervised. Sorry, what do you mean by that? It was clearly not a supervised and not an unsupervised in this way. It was a reinforcement learning, which is the third category I'm going to talk about. We stay in between. So the supervised case is when, from the environment, you get data plus a lot of additional information about the data, tags, labels. So there is really a teacher, which is sending information. The agent sees the data and then classifies it and says, OK, this is one or the other. And then also tells you, OK, this was the right classification. They did the wrong one. And then it can improve on that in order to get. But the data are still there. That's what they are. You might think of incorporating the second part, which is once you've trained the algorithm, then you go over a new set of data and you see whether it's been able to classify correctly the new set of data, which would make perhaps one other loop. But then the processor is arrested. So first thing is this kind of classical ways of learning do not require a continuous interaction with the environment. They are not made with that intent. They are basically one-shot or two-shot processes. That's the first thing. But there's one, another most important thing that is missing in this kind of approaches, which is it. What was the key distinctive feature of all the movies that you've seen yesterday? Pardon? Dynamics? Yeah, right. That's correct. But adaptation? Yeah, but rewards are the manifestation of what? Pardon? Action on the environment? Yes, that's correct. I mean, these machines want to do something, have a purpose, have a goal. That's the most important thing. There is an explicit goal. AlphaGo wants to win. The robots want to get to someplace through the snow, open the door, and for God's sake, it does really want to open that door, much as the rat wants to get the pellet or avoid the electric shock. So there is an explicit notion of goal, of purpose, of objective, which in the other two forms of learning is not evident at all. Even though for some person's purpose, you might say, OK, this algorithm wants to reduce the error. So the error is a goal. But why? Why? What for? Once you've classified dogs and cats, what the hell are you going to do with that? What's the purpose of that? So reinforcement learning is one of the paradigms, let's say, of learning. Again, boundaries are blurred, so don't take everything like just I'm putting this thing in a single case, and that's it. But it's one part in which focuses on the fact that agents have goals. These goals are somehow connected with the rewards, which at this stage don't appear yet, but we'll do that soon. And their actions are oriented in the direction of getting as much as possible of this objective. And they do so by repeatedly interacting with the environment. So these are two aspects which make reinforcement learning distinctive. There is a goal. The reaching of this goal is mediated by these rewards, which are a specific way of interacting with the environment. And there is a dynamical process which goes on. And it's a repeated interaction with the environment. So these things you have to keep in mind. And it's a very powerful framework because it allows to describe behaviors from simple organisms such as bacteria. You can try and formulate questions in this framework to, of course, mice and primates to machines. So it's really a very wide and general notion, and that's why we are here discussing this point. So what makes reinforcement learning peculiar in general is that this percept can actually be split into two components. So this percept actually will be a pair, which is made by rewards or reinforcement signals. That stands for the same thing. There are, of course, differences in concepts. And observations. So this distinction, bigger. I'll try and do that. Is that slightly better? Thank you. So both of these things are signals coming from the environment. So we just want to split it artificially because they play very different roles. So think again about the rat in the cage. So it was doing different observations. So the environment, the presence of the bar, it could see eventually the presence of the pellet. These are all observations. And the reward comes from actually eating. So we're going to split these two things. Both of them are parts of the percepts because the agent perceives the reward, say, ah, that's good. But also we want to split this into rewards. Why that? Because out of rewards, we are going to build our objective function. This is what is called in the literature a hedonistic agent. The goal of the agent is to maximize its pleasure. And its pleasure is measured in terms of rewards. That's one thing which is very important and has to be clarified from the beginning. That the goal of an agent is not necessarily not always to maximize rewards in itself. What does that mean? So let's go back to the example. What are we doing here today? So what is the reward that you're getting for sitting here? The immediate reward. Knowledge. Knowledge. Is it an immediate reward? The immediate. Signing the tent. That's better. That's the closer to the immediate reward that you get is that I signed the attendance. I just marked this thing. I'm avoiding a punishment. If you wish. This is the immediate thing. But are you here because of that? Might be. It's a legitimate behavior. But perhaps some of you have some sort of longer perspective. Let's say I'm here because I have to sign the paper, but also because in the longer run, this will be useful for me. That's a hope at least. And so let alone why I am here, that's entirely different question. I don't even have to sign. But that's important. The serious thing is that rewards are short-term feedbacks. You take an action, you get a reward. You take an action, you get a reward. But that's not the goal. Not necessarily. It's the goal just if you're horizon, time horizon. Now I'm speaking very loosely because all these things that I'm talking about will become symbols and formulas. But it's better to get an intuitive understanding of what we're talking about. So it's very important time horizon. If you knew that today at 1 PM, you will be all awarded PhD degrees from Harvard for free, this would set your horizon to a very short time span. Because do I have to struggle for many days? It's something that will come. No, it will come just after the lecture. So it's very important what the horizon is. Are you doing things in the long term or are you doing things in the short term? Another more mundane example, you have, suppose you have now in your pocket a certain sum of money, I don't know, $1,000. So there are many, many things you can do. You can go and say, OK, I'm going to rent a boat and have a party. Everybody's invited. Yay. Or you say, OK, I'm going to keep it away or invest it into some investment that will perhaps give me, I don't know, 3% after 10 years that I keep the money. These are two very different extremes. And they depend on your perception of what your time horizon is. So you're making a lot of assumptions about what you will be in 10 years if you put that into a savings account. And you're also making a lot of assumptions if you just dissipated in the course of a few hours. Yes, please. Yeah, rewards in general. So yeah, we tend, for simplicity, we tend to conflate into this notion of rewards many things. This might could be as varied as the pellet for the rat. You don't get any pellet here. For more symbolic rewards, and even for reinforcement signals, which are at a very, very high level that we don't even perceive, these rewards could be dopamine signals in the brain. It's a very abstract notion, which tries to encompass all the variety of these things. And sometimes, of course, it's not rich enough to capture all these things. But we have to start from the simplest description ever. But that's a good point. So this is another very, very important thing. Because once you accept the idea that you're not doing all your actions just for the immediate reward, and you have some time horizon out of you, which almost all living beings do, except in particular conditions, then there is another requirement that comes immediately. So why would you decide to put your money into savings account? Again, first of all, like we said, you have a time horizon. You have some expectancy of about how long you will live. That's the first and basic thing. That's crude, but that's it. The second thing is another one, is you think that there will be some things in the future that you will be putting away money for something because you eventually would want to buy a car or a house, or you may be putting them aside to travel somewhere. You have plans. So as long as there is a long-term goal which has to be achieved, the notion of planning immediately comes into the game. So a key important idea of all these learning processes is that you have to predict the future somehow. So there is this implicit task of trying to optimize action while doing things in order to get some long-term goal which will require some sort of planning. How good your planning abilities will be, it depends on many factors. And we will discuss all of them in great detail. But that's also important. So there is this kind of idea of projecting into the future in order to forecast what will happen. And all algorithms, the most efficient ones, have very good ways of predicting what will happen. In some situations, your knowledge is so limited that you cannot really predict. And then you will have just to sort of proceed in a slower way in order to learn. But still, you can get there to very good and efficient behavior nonetheless. Very good. So, questions? No. So let's start and put these things into more firm ground. And the language that we will be using is actually simplified for the task of highlighting the conceptual things rather than the technical aspects. So we will be dealing with a very simple description. For instance, in which time goes on by ticks. So we'll be discussing all our models and processes and ideas where time just advances by integer steps. This is a simplification. Of course, time flows continuously for all purposes, for all of our perception and all machine perception that can happen. But this makes the description simpler from the mathematical viewpoints. So we will stick to that. Just to know in advance that it's possible to go to the continuous time limits with many other interesting issues that can come about. But we will not have time, neither time nor the techniques to do it. Yes, that depends on how short your time to react is. I mean, if you're not disturbed by this, that's fine. That was exactly the purpose. If nobody is disturbed by the fact that we go on with discrete times, we will do that. So there is an index of time which goes on by steps. So time will be 0, 1, et cetera. So these are the times at which, actually, what are these times? These times are the times over which an interaction with environment and a subsequent action is taken. So it's the clock over which these things happen or might happen. The system might stay there for a long while before taking any action. Because taking no action is an action. So this is an interesting time which is ticking, the clock which is ticking. And at every time, what is happening in our system? So like I said, the percept, which is usually, well, it might be called like this, small et. This is the percept. This is the signal that comes from the environment to the agent at time t. And it usually is a couple, RT and YT. So this is the reward or reinforcement signal. And this is the observation. So in general, these quantities might be discrete or real. The reward in our setting is a one-dimensional signal. Just a real number. So we do not encompass in our treatment and basically 99.9% of the literature does not consider this. There is no vectorial structure in the reward. So one consequence of this is that punishments are viewed as negative rewards, like we discussed yesterday. But there are no nuances in reward. Like you could put everything onto a single line. So a pellet, which is slightly larger than another, is basically has to be compared on the same units as a pellet, which is smaller but better at taste. All these details are conflated into a single measure. Yep, it will come. Just a second. This is the immediate feedback from the environment at every time. Observation, again, larger. OK, I have to make an effort. Sorry. Yeah, this is very small, I agree. Is that one the issue? So no, it's a simplifying assumption, like I said, which is customarily taken. Clearly does not cover all situations. But if we want to understand the basic concepts of how this reinforcement learning technique works, there is already enough richness in this scholar case to cover basically all of the concepts that we were discussing. So it's an assumption, it's a limitation, but it's not very restricting in concept. No, no, no. The reward is given in its own units, which you can normalize to some level, to some reference values. But it has its own units, which might be sugar content of the pellet or whatever. We won't be focusing much on these aspects, so we will consider these abstract then. We won't care much about the dimensions. But this could be anything. Observations is instead a large vectorial thing, which might be very large. For a robot, the image that the camera is capturing at any given step, plus or other information, even by radars, plus might be a very, very high-dimensional vector. And we are fine with that. So there's much, much more scope for the description of the other word here. This is not just a single number or something. It might be very simple in some specific examples that we will discuss for the sake of learning ourselves how this thing works. But in general, this is a very, very large object. And then the action is also labeled by some symbol a sub t. And again, this could be a continuous action or a discrete action. So for chess board games, it's taken from a subset of actions that you can take it, which is a finite set. So from every configuration of the world, there is a finite set of actions that you can make that are allowed. But for some other tasks, for instance, for the movement of a robot, the action might be turned by 27 degrees, or 27.5, or 27.6666. There might be a continuum of actions, which we're not considered. Again, we will consider situations for simplicity where actions are discrete, just for simplicity. But again, the system allows for such extensions into continuum spaces of actions. And then there is another quantity, which is the state of the environment. And it's typically written as s sub t. And this, again, is a huge dimensional object. So for Go, for instance, it's the configuration of the pieces on the chess board. For the robot is whatever is outside, at least within the room. For the mouse is the cage and whatever it contains. So this is really the real thing, the thing that you would observe with perfect knowledge of your environment. So it's important to distinguish in general that the observation is not, in general, the state of the environment, even though sometimes it could be in simplified examples. But in general, the observation is just a subset of the things that you can observe in the environment. This, for physicists, should be sort of an obvious notion. You don't measure everything in the world in order to make any process. You select some particular observables. So these are the observables. And this is the system. And typically, the number of observables that you focus on is much smaller in number than the number of degrees of freedom of your system. So this is the full state of the environment. Are these definitions clear? Any question about them? So then we move and describe the first general setting, which is really very wide and compasses all of the future distinctions that we will make so that it goes under the name of the full reinforcement learning problem. We require a little bit of additional definitions that we want to introduce. And this will formalize the notion of our agent experiencing interaction with the environment, collecting information about it, making actions, and so on and so forth. So in order to do that, we will have to introduce first one thing that is the history. What is the history of the history? And typically, this is labeled as with an H less than T. So this means that it's covering everything that happened before that time T, at which decision will be made. And this is just a sequence made of all the previous things that happened to the agent and that the agent is aware of. For instance, the first action that the agent took. I'm starting this way. I could have started this way. It doesn't matter. So I'm starting here. At time T equals 1, the agent makes some action. As a result of this action, the environment will respond and say, oh, the agent did that. For instance, the mouse pressed the bar. And then the agent reacts and sends out information to the mice, to the rat, telling, you will get the reward. So this will be plus 1 reward, for instance. And you will observe what you've been doing. You will see that fall into the box. Observation. And the agent records this. So it has gotten a reward. And it's made an observation. And then the thing cycles again. There will be a new action with a new reward and a new observation as a feedback from the environment and so on and so forth, up to time T minus 1. Action T minus 1, reward T minus 1, observation T minus 1. So that's a previous history. Now what the agent is doing now is that given this stream of events that occurred, all these observations and actions that he took and rewards that he got, it has to decide what to do next. And what to do next is what goes under the name of a policy. What is a policy? A policy is how actions are implemented. And in general, a policy is a probability distribution of all possible actions. So suppose your agents has 10 possible actions to do. There will be a probability distribution over these actions. Some of them the agent will deem to be more probable and others less probable. Because typically we act like this. So we have to decide between different things. If the decision is difficult, we don't just say, OK, I got that. Perhaps I should do that. This is going to be better. But yeah, there is a chance that other actions would be equally good or better, et cetera. So this allows for the possibility of describing policies or decision-making which is as a certain random component. So this policy is a probability distribution. So it's a probability, say, at time t. This is a sort of redundant notation of taking one action a given the history. This is where the agent makes decisions. It has a certain history. So it looks at this string of numbers. And then according to this function, picks one action of the many that it can do. So this is one generic a among all the possible actions that can be taken. And it picks it according to a probability distribution, which is pi t. This includes also deterministic strategies, that is. If this pi t is one for one specific action and zero for all others, then it's deterministic. Otherwise, it's random. The machine throws a random number. And according to the probability of this, pi t picks that action. Clear? So this is the policy. As a result, the new action, a t, will be picked from this probability distribution, pi t, condition on the previous history. That's what I mean. I'm just rephrasing the statement. I will pick action a, d by time t, according to that probability distribution. So this is a description of how the agent perceives the environment and collects information about the environment. In this case, which is also a particular case, there is no processing of this history. So in principle, the agent could have an infinitely long record of memory behind. Of course, it's always possible to extend this into the case where there is some forgetfulness of the agent. So parts of the history will be erased. We will discuss this with you at the time, also. But in general, let's think that this agent has no memory limits and can store all this previous experience into this string and then pick one decision. The question is, again, why does it want to do that? So what's the goal of this? And then reinforcement learning theory proposes one specific definition. Well, there are many of them, but this is the most common one. What is the agent's goal? So the goal is maximize the following object. I'm writing this and then I'm explaining what that is. So this stands for expected value because all these quantities are random in general. Rewards might be random. Observations are random variables, so everything is stochastic in general here. So this is the expectation value of the sum from, say, tau going from 1 to infinity. Let me pick the index right for the first time. No, let's say tau going from 1 to infinity or 0 to infinity. So this is the objective function. There might be other definitions. So this is one. This is one possible. What is this? So this we know what that is. It's the reward that you get in the future. So you are sitting here at time t. All these things are happening later in the future. And you want to maximize this kind of weighted sum of future rewards up to very far in the future. The thing that we are, yeah, these are the weights. When I say weighted sum, I'm meaning that these are the future rewards and these are the weights. In this sense, these weights. Yeah, I'm about to tell you. So this gamma factor is called the discount factor. And it's one way of implementing the notion of time horizon that we'll be discussing earlier. So this gamma is actually a number which typically goes between 0 and 1. 0 can always be included 1. It depends on particular properties of the process. Might be strictly less than 1 or equal to 1, depending on what kind of task. And we'll see something. So this discount factor is implementing this notion of time horizon in one simple way, which will be the one that we adopt because it's simple. In order to understand how it works, think about the simplest case, gamma equals 0. If gamma equals 0, all the terms of this sum in the future will vanish except the first one. Sorry, this was gamma tau. Sorry, sorry, sorry. Except the first one, when tau is equal to 0. So when gamma is 0, it's only the next reward that matters. And this is the behavior of your fellow. No, sorry, I'm just kidding. Which only cares about optimizing the immediate reward. It's also called the greedy behavior. You just look one step ahead in the future, and you say, was there around that I can grasp? I will do that without caring for the consequences. So it's a very myopic or short-sighted behavior. So as you go with gamma 10 into 0, you go towards myopic behavior. You don't look very far in the future. You don't care. On the other way around, when gamma tends to 1, this sum takes a very long time before being cut off by these exponential factors. So you're looking very much into the future. And events that happen after 100 steps are equally as important as events that happen in the immediate step ahead. So in that case, when gamma tends to 1, you will require to act wisely and say, OK, I missed the lecture, immediate reward. But in the end, will I suffer more? Yes, in two weeks' time when the exam is there. So this is just one way of implementing this. And it's appealing theoretically, because you can think of this as in a very simple way. So you can interpret this gamma as the probability that the agent will die after each step. Yes, you can extend this in all possible directions. So these discount factors might be any function of time that you wish. And people have explored this. And in the psychology literature, actually, you can see that this is not a very good way of representing how human decisions are actually taken. So this is appealing for two reasons, this description. So the general condition is that with that description, we will be able to understand a lot of things. And this is actually what is done in most algorithms. That's the sort of basic motivation. Then there is one motivation which is more historical. These discount factors come from economics. Whereas when all this decision, so suppose that there was now our money, and then all these things is the stock market, then it's a legitimate description of what's happening in the financial markets this one. And then this gamma would be essentially the rate at which your money, if you keep it in your pocket, will increase value. So if you don't keep it in your money, it will lose that amount. So this discounting factor comes from economic considerations, if you get an idea. There is another appealing, which now and now it's appealing for physicists. And this was the one I was talking about before. You could think of this gamma as the probability of dying at every step. In a process that you make a step, then with probability gamma, sorry, with probability 1 minus gamma you die, and probability gamma you survive. So it's a survival probability. And then after tau step, there will be a probability of survival, which is gamma to the tau. So this is an agent which at every time could be just pulled out of this game by fate. And that's why you think of this thing as an horizon, an effective horizon. Because if you expect that you will live 100 years, then you will put this object to be 1 over 1 minus gamma would be 100 years in your time steps. So this should be intuitive. It's just like a radioactive process taking, geometric process taking steps over discrete times. Like I said, punishment here is just negative rewards. So when these rewards have a negative value, they will be interpreted as punishments. And when they're positive, they will be inducted as encouragements. So like I said, this low dimensionality of the reward is also what do I want to say? Yeah, it's problematic in the sense that it doesn't reflect all the complexity of the experiences that we get from the viewpoint of the rewards that we have. But it's also very interesting because this system will be able to, our algorithms, the algorithms that we describe, will be able to act very well and even optimally in some specific cases on the basis of a very partial feedback. Because at every step, the observations are information about the state of the system, which are, of course, helpful. As an agent, you will want to rely on that. Even though you could even have an agent which does make no observation, and the only thing that it perceives is rewards. And rewards are, of course, a very unfaithful description of the environment because they are not able to capture everything. So suppose you make your mice, your mouse, or you're totally blind, and it cannot sense anything. But still, when it can only feel that a certain action follows a reward, that's the only thing that it can feel. And he will be able to learn nonetheless, even on the basis of a very, very unstructured reward as a signal. So that's why it's also important to keep this reward signal very low dimensional because it shows you that actually in order to improve the behavior of your algorithm, you will just need, essentially, feedback from rewards only. That will be sufficient. So before moving, gamma is given. In this process, so some things will be given, some others not, but gamma is given here. It's not something that has to be learned. What is given and what has to be learned will be clear in a second when we move on. But at this stage, the crucial thing is that what has to be learned is this policy. It's the choice of action that maximizes this. So this maximization is a maximization over the policies. That's the key thing. We want to find decision-making rules that map histories into actions in such a way that our future return, this thing is called the return, which is different from the reward, like we said, unless gamma is equal to 0, in which case they coincide. We want to maximize this long-term return. Is that clear? What the goal of the thing is? I assume it is. So as physicists, all these things should sound familiar to you, to some extent. Does this ring for any of you about anything that you might have encountered during your studies? Sorry, it just don't overlap. I wasn't able to speak up. OK, yeah, yeah, that's related. But I was thinking about something more basic that you might have encountered. OK, yeah, that's a variational principle, which is common to the idea of maximizing something. A cell, that comes from biology, so it's fully in it. That's something for physics. It's too complicated. It's simple. I'm looking for something very, very simple, just like I said. You're looking for difficult things. Yes, yes, yes. The thermostat in the room. The thermostat, which is keeping the temperature, I hope, at a certain fixed level, is a device like this. It's a problem of feedback control. There's something which has existed in engineering since, I don't know, 100 years. And in physics, that's the same. You have a dynamical system, and you have some parameters that you can adjust in order to make your system do what you want. That's how machines work, right? You have to find the right way to control your system in order to make it do something. So this is one instance of a feedback controller. Just, of course, it's more general. It's more interesting. But the roots of all the things that we're discussing are in the realm of optimal control theory. In particular, this would be stochastic optimal control theory. And then you might have different languages depending if your system lives in continuum space, as it does, or in the discrete set of states or actions, et cetera. But if you're abstract from this, that's exactly the same problem. Very good. So now we move to another somewhat descriptive part. So I will erase this part of the blackboard. Perhaps I will erase that part. So this is another important classification that we have to make at the beginning that it will be helpful to guide us into the different things that can be done. So this is, this full reinforcement learning problem is actually extremely complex. So suffice it to say that, to date, there is no result, no mathematical result available, about the properties of any specific heuristic policy that you might think of implementing in this case. So this is really terra incognita for mathematicians for this full reinforcement learning problem. But there are versions of that simplified in many aspects, for which we have, on the contrary, many results. And very interesting and tell us a lot about the limits of these learning processes and the benefits of being able to predict or to plan, et cetera. So in order to navigate into this complicated space of many things, I find it helpful to try to locate into some abstract two-dimensional plane all the kind of versions of reinforcement learning that I've been around in the last period. This is not exhaustive, like any two-dimensional map of a very complicated multi-dimensional problem. But it's something that I think it's helpful in organizing your thought. So and also, this will provide us a sort of a path through the lectures. We will start in from one point, and then moving to other points, et cetera. So like we said, a key step in being able to optimize your performance, optimize your return, and making good decisions is the ability somehow to look into the future. Now again, as physicists, there are two key ingredients that determine your ability of predicting. And I want this to come out from you. Two ingredients. You have to know two things to be able to predict the future. You are physicists, so don't think complicated stuff. No, no, one at the time, not everybody together. So raise your hands and out. The present. The present, that's a good answer. Let's elaborate a little bit on that. What does that mean, the present? Initial conditions. And why are we limited in specifying the initial conditions? You're messing it up. The initial conditions of a dynamical system. I mean, I'm talking classical Newtonian mechanics. The initial conditions of the system are, my particle is at position x with the velocity v. Is that a legitimate initial condition for the problem? It depends. Phase space of one particle. Like I said, don't overcomplicate. The phase space of one particle, is it fully specified by position and velocity? Good. Then is it fully specified independently of the potential of all the forces that are acting on the system? Right? So one thing is the initial condition, another thing is the laws. Good? OK. And this is actually the answer of the thing. In order to predict efficiently, you have to be precise in the determination of your initial condition. And you have to have good knowledge of what the laws of your systems are. So if you know the initial position and velocity of your particle, and you know what is the potential forces, what is the non-conservative forces, what all these things are, and you know that the Newton laws hold, then you will be able to predict. Sometimes for a very short lapse of time, if your system is not interval itself. But you will be able to look a little bit into the future, at least. But as soon as your ability to characterize your initial state, or your ability to know what the laws of nature that regulate the motion of this particle are, then your performance in prediction will degrade. So how does atmospheric prediction work, combining the two? So there's an increasing demand in getting real-time data, and then increasing demand in having more and more accurate computational models. If you lose one of these two legs, you just fall to the ground. You might have a very, very accurate snapshot of the actual maps of temperature, temperature, and pressure, et cetera. But if you don't have any computational tool and equations that describe this faithfully, this will predict no sense. And the other way around, if you have a perfect model of what happens in the atmosphere, but you are not fed with actual initial conditions, we will be predicting something, but not what will happen here or tomorrow. OK? So let's try and put these things on this axis, because this apply for our problem as well. So there will be one thing, which is let's call more generally the precision with which we measure the environment is something related to this step. Is how we convert, how the information about the state of the environment is converted into an actual observation. So let's get back again to the example of the particle. A single particle, it lives in position and momentum space. Now, suppose you can observe only position. You will not be able to control this particle as efficiently as you would do if you knew position and momentum. And this is a limitation, because your subset of observation is not the full state space. And again, this is a real problem, because in most applications of artificial intelligence and post-mortem learning, you don't observe the full state space. We don't observe the full set of possible configurations of the go. We don't observe the whole world around us. We just get a subset of observables. So this axis, let's call it information about the environment. Again, it's very qualitative. The more you go up this axis, the more you are close to knowing the real state of the environment. This is the true state, which is typically hidden to us. So if you go up on this axis, you will have access to the real state, which is a big thing already. And on this axis, we will put our knowledge of the environment, let's say. Here, I mean knowledge of the laws that regulate the environment. Do you see the sense in making this sort of categorization? So up here, wow, that's a beautiful place to live, where you have perfect knowledge of what the future will preserve, will reserve you. And if you combine this with the perfect information of what's your current state, then you will be able to act optimally. So this is a situation where you think about the stock market, where you know in advance what will be the consequences of all your actions, and you know exactly the situation of the full market. And then you are here, and then it's just a matter of computing. So the decision-making problem in this area actually becomes a problem where you don't have anything. There's nothing to learn. It's just about computing. So could you please pick up? Well, that's a situation in which you are sort of not so very far away on the right on this axis, because your model is trying to is not. Yeah, that's the knowledge of the full dynamics of the system. That's just like God came and told you, if you had this certain set of values of the stocks in 10 days from now, this will be like this. You just have to compute. You do the job of computing, of turning the crank, but the laws are given. Sure. Is that clear? So up here, actually, it's the realm of a very specific subset of problems, which go under the name of Markov decision processes. They live up here. Yeah, exactly. That's the limit. In that upper right corner. Then we will show, and that's the first thing that we will do, that in that case, everything boils down to computing things. To computing things. There will be algorithms. We just compute the optimal behavior without having to learn. You don't have to learn. It's just because you know how to predict with infinite precision in the future. And this will be the starting point, because lessons that we learn from here will be put to use to navigate this part of the spectrum of possibilities. At the opposite end of this, this is where the full reinforcement learning problem stays, the one that I aptly erased. Here, you just rely on the previous history. That's all you have. Things that have already happened. And you have to base your decisions on that. No attempt whatsoever at looking into the future. And observations might be very, very rough. You will be able to learn. That's what you're saying. Yeah, sure. Otherwise, that would be hopeless. You can accumulate information and then use that for your purpose. That's the whole thing. Of course, everything will be much slower. Not really. In the sense that in this setup, there will always be experiences that you make and some others that you don't. Here, you're already told of what kind of experiences are possible or aren't, et cetera. You're told everything. And here, nonetheless, even if you have made. So suppose that your sequence of observations is a sequence of ones. OK? At any time, you're not really sure whether that would be pop out a zero sometime. So you have to account for that possibility as well. Here is that you're told that this is one. And again, the drift in this direction is whether you don't know whether a zero will pop out, but it can pop out also an emoji or whatever. And here it thought, OK, it can be zeros or ones, or it can be any integer numbers. And that's the way I see you're moving along this line. There's a notion of quality of the information that you get. So then there are two other corners, which are also super interesting. So let's start with this one here, down here. This is a situation in which you have knowledge of the laws of physics, but imperfect measurements. That's one example, but I'm not very much willing to go into that. I will give it to classical. So it's just a very simple example. Suppose you have, I don't know, everybody knows what is an optical trap. It's an optical device which you can use to manipulate small particles. So this is a small particle, which just fluctuating in water medium. And then you have this laser, and you can move the particle around. And then you have to control this particle and say, I want to move this particle from A to B. That's my goal, the goal of my task. And I want to do this, and I have to observe where the particle is. Because if the particle is in another place, I need to tune my deposition of my trap in order not to lose it. And now the observation that you can make of that particle depends on the quality of your optics. If you have a PhD looking like this and say, oh, the particle is there, it's very low quality. If you have a super fast microscope, it's very high. But there will be also latency time and all kinds of problems. So there's a continuum of quality of the observation that you get. There is a continuum of differences between what the position of the particle is, which is just one part of its space space, and the actual observation that you get. But nonetheless, you want to control. And you know what are the laws of motion of that. But you just don't know something about it. Or you know it with some uncertainty. But still you want to control. That's perfectly legitimate task. And that's actually what happens in practice all the time. So what is happening here is that you want to control what is called a hidden Markov process. So there is some underlying Markov process. Newton's law, Newton's law are a Markov process, the termistic one. It's there. But the actual states of your system are hidden to you. We just get some observation out of it. So this part here goes under the name of two names, which should be a self-explanatory. One is partially observable Markov decision process, or POMDP. These also are actually controlled hidden Markov models. But they are not known under this name. But that's what they are. There is a Markov process underlying which we want to control. We want to make it do things. But it's hidden to us. Of course, there's, again, a way of going continuously from one thing to the other depending on how good your observations are. If your observation is perfect, then you jump immediately onto that. And you know the laws of nature. Now there's another way of moving in this diagram, which is in this direction, in which your knowledge of the environment decreases. So you have less and less power of predicting. But you are very good at knowing where you are. So you get very good information about the state of the environment. In the limit, it's perfect. You know in which state you are. You get exactly position and momentum for your particle. But you don't know what the Newton's law. Sort of the lazy experimentalist corner. Get super perfect observation, but no idea how to use it. You can do a lot in that corner as well by very different techniques. Because in this case, you don't have what is called. So the way of talking in reinforcement learning literature of these laws of nature, Newton's law, is actually what is called as a model of the environment. It's a model of how the environment will evolve according to the actions that are taken and the state in which it was. So in this corner, this is the corner where most of the initial theory of reinforcement learning was developed. And it's a corner where we will explore techniques. And I will call it just model free RL. In this case, it's with perfect observation. Whereas down here, it's imperfect, imperfect observation. It's out of the, there's little dynamics in the superviolet in the case, right? So that's the first comment I made. So this is interesting to something where you have to look into the future and you want to optimize about something. So we will be spending time here tomorrow. So they have to tomorrow here and then here. And then we will try to sort of wrap up with a general recipe for working anywhere here around, which is actually where real algorithms work, OK? That's sort of a path, abstract path that we'll be working in the next lectures. Good, it's not a good idea to start with this because this will require a little bit of technical work. So it's better to defer this to tomorrow. So we'll spend, we still have 10 to 15 minutes, by giving you some examples of these kind of problems, which will be a little bit of our work courses for the rest of the course. So on these examples, we will try and do simple calculations and see this algorithm at work in cases where you can actually work them out basically on the blackboard, OK? So these very simple examples are easily listed here. So the exercise here would be to identify all the things that all the subjects and ideas that I've been discussing into this particular task, OK? So the first class of problems that we will be recurring, it's actually a very large class of decision processes, which are by no means trivial, but whose conceptual structure as a Markov process is very simple. So you had the tutorial yesterday about Markov processes. The other ones are already at a good stage. So when you define a Markov process, what do you need? What's the basic ingredient for a Markov process? Transition, metrics, transition between what and what? First states. Ah, OK. First states, then transition between states, OK? So what's the simplest Markov process then? Sorry? Random mode. My guys, you are so over complicated. Markov chains, yeah, we're talking about Markov chains. What's the simplest Markov chain? How many states does that? Yeah, of course, because when you learn to count, you say, oh, two, three, four, five. What's the simplest Markov chain? One state, one state, OK? There is one state, and then the transition metrics is between what and what, the state into itself. But now, let's suppose you could go there by different channels, different choices, different actions. For example, there's just one state. Then you can take action 0, or you can take action 1. So this is one way of drawing diagrams that is very useful in the context of Markov decision process, so you are expected to get familiar with that, OK? So this is the state. In a more general problem, there will be many of these modes around connected the main transitions, and that will be the Markov chain that you have in mind. But this is already with something, right? So there's an action. And then, of course, the outcome of this action has to be that you get back, right? That's for sure. But again, as a combination of the initial state and the action, you might get there through different paths as well. So what's happening here is I'm in my state. This will be forever in that state, so let's forget about it. I take action 1, then the outcome on my action could be one of these two arrows with some probability. So let's say with probability p, I go onto this branch, and I get a reward plus 1. So this is the reward. Now, with probability 1 minus p, I get onto this branch, and I get a reward minus 1 on this other action, OK? So let's add another level for symmetry. Let's say p1 here, and then we do the same here. With probability p2, I go back with a reward plus 1, and with probability 1 minus p2, I go back with a reward minus 1. So this is a customary way of drawing these diagrams. And you can imagine that you can have many of them. We will draw another one, which is more complicated in a second. So these are the states, just one state. This is a Markov chain. One state. How many transition probabilities? Many. Your action, my action space is these boxes, OK? So for every state, there is a set of allowed actions, which here are 0 or 1, which I can take. And if I take one of these action, then something will occur, OK? So environment, here there is no measurement, whatever. No observation, because there's just one state. Every observation, there can be no error here, OK? So this is just wiping off the problem of observations. You just know that there is just one state. You will take one action, and then the outcome of your action will be random. Sometimes it sends you over one, sometimes on the other, with a given probability. And depending on what happens, you will get a positive reward or a negative reward. It will be. It's just like if the mice takes this and pulls the lever, then I don't know. 95% of the time, it gets the pellet, and 95% of the time, it gets a shock. The 5%, sorry, 5% of course. There was a question up there, and we're just relaxing. I thought there was someone raising the hand. And this occurs on the other hand with another probability. So what is the policy here? Go only four. Sorry, I didn't get what you said. Is the benefit or the cost of action is the same as the reward? Yeah, that's the goal. The goal is to maximize the benefit, which we didn't define yet, but it will be the discounted sum of all the rewards that you get, as usual. The policy here is, what probability should I give to these two actions? So there will be a probability pi of taking action 1, and a probability pi of taking action 2. Sorry, 0. And these two must sum up to 1, because we have to go somewhere. 1 half is one policy. Is it the best policy? OK, did you understand the suggestion? So the suggestion is, the best policy is to act proportionally to what the rewards are. Is that right? No, that's not your problem. No, is that right? Sorry, is that right? Am I correct in interpreting what you're saying? The same thing. Good. OK, proportional to the probability of getting the reward. This is something which exists in the literature. So now the second is that right is, does everybody agree that this is the best strategy? It depends on your horizon time. This is one thing. We have to assign all the probabilities, so these will be a deterministic policy in opposition to the one that's defined earlier, which is random. Which branch? OK, the one with the biggest p. If p1 is larger than that, you will go for this and vice versa. Any other suggestion? So who votes for the proportional strategy? Who votes for the all-on-one strategy? That's what we'll do tomorrow. Discuss, think, come to a consensus.