 So welcome everybody. This is our first lecture of the course and today it will be a quite gentle lecture in that we will mainly discuss about concepts and ideas and we'll leave the formalism for tomorrow. So the basic message for today would be to understand what is reinforcement learning and what are the kind of problems that it wants to address. So in this first half of the lecture, we will just go through some examples in order to give you an idea of what are the, what's the playground for reinforcement learning what are the kind of problems that it wants to address and why it is different from other branches of machine learning with which you might be more familiar with. So like I said, we're just going to be going through simple examples. These examples are useful because they will pop up repeatedly during the course. So it's important that you familiarize with these ideas from the very beginning because this will be our work courses for the following. So every time we go back to some specific example we do things like connected to the examples that I'm going to give you now. So in the first example actually goes under the name of a category of problems which are rather popular in reinforcement learning which are called multi-arm bandits. So here already there is a lot because this name doesn't mean anything in itself. So it's worth starting with an explanation of what it is. So do you know what an one-arm bandit is? Have you ever heard this expression, a one-arm bandit? No personally. No, okay. It's a slot machine. It's a slot machine, exactly. So it's a jargon for a slot machine. It's actually a very old jargon because once upon a time, so when I was your age, slot machines were mechanical objects. So there was a lever and you had to pull this lever and then there was a grinding sound and then everything came out to most of the time nothing out of the slot machine. And this lever was called the arm of the slot machine. And since they had one single arm, these are one-armed. But why are they bandits? Well, because they get away with their money, okay? So you put the money in and you don't get the money back. So that's the origin of the jargon. What are multi-arm bandits? Well, you have to imagine a situation where you have several slot machines. Okay, let's start with simply with two slot machines, one beside the other. And in the simplest setting, think about a problem in which one of the two slot machines is better than the other. In what sense? It's better in the sense that if you pull the arm, you have a larger probability of winning for one of the two machines. And the problem in general is to understand which is the best machine while playing it. But most importantly, the final goal is not to learn what is the best machine. The final goal is to get money. Okay, so the objective in this case is not to collect information. Collecting information is useful because without information, you don't know which one to pick. But the primary goal is to get money out of it. Okay, so in order to fix the ideas on this very simple problem, let's consider a simplified version of this. This simplified version, this is annoying. Okay, the simplified version is called a two-armed Bernoulli bandit. Okay, this is a fancy name for a situation in which you have two coins. Okay, you have coin A and coin B. So I take out of my pockets two coins, which I sort of fabricated in a way that one of them has a certain probability of giving head and the other one has another probability of giving head. Okay, and you as a player can ask me which one of the two you want me to flip at each round. Okay, then you say coin A, which is this one, I flip it and if it's head, you win. You win one unit of money. If it's tail, you don't win anything. Okay, or you lose whatever, doesn't really matter. Okay, and then if you ask me pull B, then I will flip this other one. Okay, so at every round you can ask me just to pull one of them and the goal is after a certain number of trials, which we can fix in advance, let's say capital T trials, your goal is to get on average the biggest number of wins. Okay, so why is this problem difficult and why is it interesting? Well, obviously if I told you which were the biases of the coins, this would be a very simple problem. Right, if I told you coin A as a certain, so let's say that the observations that you get y are either zeros or ones. Okay, so zero corresponding to tail and one corresponding to head, then if I say mu sub A, this is the expected value of y. Okay, this will be basically the probability that coin A gives a head and similarly for mu B. Okay, so let's say expectation under the probability distribution of B, expectation under the probability distribution of A can be respectively. Okay, similarly. All right, so if I told you what are the biases mu A and mu B, then the problem is simple. You just say, okay, let's take the maximum between these two quantities and I will always play the coin with the maximum probability. That's fair enough. But the problem in general is that you don't know that in advance. Okay, so let's suppose that you want to follow for this problem as strategy, which is basically inspired by the simplest intuition possible. So I need to collect information and I need to make decisions. And let's imagine that we split this problem into two sequences. So at the first step, we explore and then in the second step, we commit. What does that mean? Okay, so let's suppose that we have a certain number of rounds, say capital T rounds. Okay, this is the total number of coin flippings that we will make. And then we decide that for n smaller, typically smaller, but it could be smaller equal. So let's say for n going from zero to T in general, we do exploration, which is we flip the coins, say 50%. Okay, so or we do round robin, this coin, then this coin, this coin on this coin, or you take A and B at random. Okay, so for example, A, B, A, B, A, B. Okay, so if n is even, you will have visited n half times the first coin and n half times the second coin. And given the series of numbers, okay, so you can make a table. So for coin A, going from one to n half, for coin B, same, you keep track of all the observations you had. Okay, so for instance, this was a head, head, tail, tail, tail, head, head, and so on. Okay, and same here. Okay, so the idea is that in this, you start here and you see a tail, a head for the first coin, then you go down, you don't can see any pointer, right? This is something I... No, they followed by the bottom bar with the pens and the colors. I, what do I do? I mean, say that again, I didn't get it. What you're writing, it's covered by the bar with the pens, seals, and the colors of the tablet. Which I should do like this. Okay. Okay, so if you... So this is the first sequence and this is the second sequence. And you're alternating between the two dice, sorry, the two coins going up and down. And this is what you serve. And then at the end of this exploration procedure, you have collected a certain set of numbers from which you can extract some estimates. Okay, so you can estimate your bias for coin age just dividing, say, by total number of heads divided by n-halves. Okay, so this is the empirical average that you compute. And same thing you can do according to UV. Okay, this is simple enough. And then after that, from m, sorry, from m plus 1 times from m plus 1 to capital T, okay, in the second half. So this was times from 0 to n. And this is from times from m plus 1 to capital T, you commit. What is committing? Well, say, okay, I've collected enough information, then I will choose pick the... Professor, what was n here? n here was the total number, sorry, n was the number of rounds in which you do exploration. Okay, so in the first part, you just flip between the two coins. You say 0.1, you ask me coin A, coin B, coin A, coin B, coin A, coin B, and you do that overall for capital n times. And you know that the total number of times that you can flip the coin is capital T, which is larger. Okay, so in the first phase, you flip the coins alternately. And in the second case, you stick to the coin which was performing the best after the first n rounds. Okay, is this clear? Yes, thank you. This algorithm is called explore then commit, okay, because you first explore without selecting anything in particular, and then you commit to the second, to one of the two options. Okay, so this is a typical approach that you could see from data science, right? So that's something of first collect the data, then do something, make a decision, and then you use this decision on the rest of your data that are coming. So one thing that is peculiar to reinforcement learning is that you don't want to do this. You want to do something which continuously interacts with the data. So as you receive one data, you decide something. And then you go on and you enter a closed loop in which you receive information, update your decisions, and so on and so forth. So what is the problem with this explore then commit algorithm? So the question is what is the best n even t? So when should I stop exploration? Do you have any ideas how to identify? So suppose I tell you we're going to flip the coins overall for 1000 times. So my capital T is 1000. I guess when they stabilize the estimation. Okay, that's good. Quantitatively, how do you construct the criterion for this? This is a good point. Okay, so let's let's see what what is going to happen like making a little graph. So okay, so let's say this is time. So this is our final horizon. Okay, this is another one which will come up often in the future. So T is the horizon of our decision process. And we want to put some n here. Okay, and we want to put it in the best position. So let's suppose that we put here our estimates for the two coins. And every time we run one of the coins, these estimates are updated. Okay, so the empirical mean is continuously updated. If we go infinitely far, okay, eventually they will converge asymptotically to two values, right? So these are the true mA and mB, muA and muB. But we cannot see them. This will just require an infinite number of trials before we get there. We will get there with certainty by the low large numbers. But if we have a finite sample, there will be errors. Okay, so at the beginning, our estimated values, of course, they will just be zeros or ones. Okay, because at the first step, the empirical average is dominated by the first result. So you will start somewhere. For instance, in our sequence at the first step, if I just consider capital N equals to one, I would have to limit myself at the first step. And this would mean that muA estimated starts from one. This is mu. Okay, let's use colors since we have them. Okay, I will put in blue the sequence of ways, which starts from one. And then it moves around and fluctuates, okay, and somehow stabilizes around this value here. At the beginning fluctuations are larger, of course, and then they are smaller. And you know that how do they, how do these fluctuations go down? So what is the variance or the standard deviation of the empirical mean as an estimate? Of flipping coin. Do you remember this? How does this go down? It's like one over the square root of N. Exactly. It goes like the standard deviation of the single variable, which is Bernoulli, so it's just sake. So it's essentially, this is a square root of muA times one minus muA, which is the standard deviation of the coin A divided by the square root of the number of trials for that coin. Okay, so if time is t, this will be t divided by two. It doesn't really matter. The important thing is that it depends on the inverse of the square root of time. Okay, so this is what I wanted to depict by this curves that approach the final level. Okay, so this is something that goes like this. And then what happens for the coin B? Well, it's pretty similar. It starts out at zero in this case because, just because I put a zero here as first term in the item and then moves up and then starts leveling up towards this final level. And same here. So according to this, when do you think it's a good time to stop? Well, intuitively, the good time to stop is when if I draw, let's take, okay, I'm gonna challenge my own drawing abilities. Let's take a section here, okay, at time n. And I'm gonna plot this, the estimates for both at the time and when I stop. Okay, so according to what I told you, and you know, I hope you're familiar with this, but essentially with theorem, basically, if capital N is sufficiently large, your distributions of A's will be approaching a Gaussian centered in the true mu A and with a standard deviation that is this one and goes like one over square root of N. Okay. And same here for the distribution of mu B, which will be centered here. And again, the distance between the mean and the typical distribution smaller than some certain confidence level is over square root of N. Okay. So a good N is when these two distributions separate enough. Okay. When they don't overlap too much. Okay. So maybe here N is too small. But if I went further on, okay. And I Professor here, you started with the two same coin, but the results are so much different. What do you mean two same coins? Each coin has its own bias. So these are two different bias coins. They are biased. If they're the same, there's no decision to make. Okay. Yes. Maybe they are the same and you don't know it, you have to discover, but that's something I'm going to address in a second. Okay. And because of their bias, the point A and B are starting from different points at the beginning at time zero. They start at different points because this only depends on the first item in the sequence. If the first toss of the coin B gave one, they would start at the same point. What is important is not where they start, but it's where they end up. Right. Thank you. Okay. So when you now move to a situation in which you have the same picture as here, but now since you have played many more times, these distribution are well separated. So these are the distribution of the estimates. Then you could say that this one is okay, but this one isn't. Okay. So in theory, the idea would be that, okay, let's skip track of the empirical means and variances. And when there is enough separation, then we stop. Okay. Now the problem is this notion of enough separation you see depends on this quantity. So the difference between mu A and mu B, whichever the largest is, in this case, A is larger than B. This quantity here, actually, we can put it without absolute values. We can call this obviously the gap. So you see that here there is some sort of circular thinking. I should stop when the two distributions are far enough with respect to a gap that I don't know yet. But if I knew the gap, I would know what decision to make. So this kind of algorithm explore then commit has one issue is that it requires knowledge, previous knowledge about what the distributions would be. And for each trial, you can compute the mean of the two. Right. At each trial, you compute the only thing you receive is either a zero or a one. So you can update your average. Yeah. So you have, you can compute your mu every trial. You, what do you mean? Every trial you get just a one. So your mu, your estimated mu is just something that you get out of a sequence as this sequence approaches large number. Okay. I'm not, I'm not understanding what you're actually asking. So mu is a, is a quantity, the true mu is a quantity that you don't know, the real bias. This object here is unknown. This estimate is what you can compute. But it depends on the particular sequence. Okay. So suppose that your coin is fair. But then by chance, you have a large number of heads. Your estimate will be very different from the real value. Yeah. Yeah. All right. Sure. Why do we only consider the gap as the difference of the mean? Because I think that it's possible that the means are so close, but they have a big overlap. Yeah. This depends also on the variance. Is there, you're totally right. It depends also on the quantities that are here. Okay. And we chose the influence, but these quantities here nevertheless are bounded by one over four. Okay. So that's something like that. It's a secondary effect. What is most important is that if you miss out on the right choice of the number of times that you explore, you can pay a very big price for this, this wrong choice. So it turns out that if you do the analysis of this algorithm, basically you can find out that there is a best choice of n, which is approximately equal to one over the gap times the logarithm of the total number of times. And if you do, if you choose your n smaller than this, which means that if you start your exploitation phase too early, you pay a big price for that. But if you choose your n too large, that is, if you explore too much, you also pay a big price. So there's a very relatively narrow window of choices for your horizon in this algorithm. And it depends on the gap crucially. If the gap is very small, you will need to have much longer exploration times, which makes sense. But if your gap is large, the problem is easier and then you can decide. So these kind of algorithms are therefore, like explore and commit, are difficult to tune and they are not adaptive. They are not adaptive because if you want to now to double the length, suppose that at some point of the game I tell you, okay, we're not going to play 1000 times. We're going to play 2000 times. Then you will have to modify the number of explorations that you did. Okay, so maybe you didn't explore enough. So this is very extremely cumbersome. And it's something that is not compliant to the spirit of reinforcement learning, which is, like I told you before, to explore and exploit at the same time. So we are looking for solutions of this problem, which do not separate exploration and exploitation in two different phases, but things happen at the same time. So while gathering information about the environment, at the same time, you improve your decision making. This is very important, especially in situations where it's not getting data is costly, like in robotics. In this kind of situations, you don't want to do a lot of exploration first and then optimization later. You want to intertwine the two processes at the same time. Robotics is not the only motivation. The second motivation is that in the end, we would like to use these kinds of schemes, not only for computer science applications of robotics, but also, for instance, as a tool to understand human behavior and animal behavior, because reinforcement learning is also something which has connections with neuroscience and psychology. Can I ask something? So in the exploration community, that you will decide your number of N before starting the whole process? Yeah. Okay. So, sorry, can you repeat that again, because I'm not sure I got it exactly. No, in this algorithm, you already, you will decide N before starting the whole process. In order to decide N, you need to know capital T. Okay. So this thing is known. But then in order to find out the optimal N, you should also know the gap, which is not known. So this algorithm will work if I tell you, for instance, one of the two mu A minus mu B is delta or mu B minus mu A is delta. You just have to decide which one is better. But you need information about how different they are. Okay. So in this simplified situation, you can use this algorithm. Otherwise, if you don't know delta, you will have to estimate it. So this algorithm will have to do two things at the same time. Find out what the best N is with an estimate of delta. So the reason I'm not explaining you the improved version of Explode and Commit is the one that I was telling you before, because this is not what we want really. We don't want to deal with situations where we have to first collect data and then optimize. And these other algorithms that we will use in the following do much better than this algorithm. So this is just to show you how a naive approach can be extremely fragile and very difficult to implement. Okay. So any other question on this? Yes. Just a couple. Can you repeat again what's the role of the time horizon, the capital T? Okay. So the definition of this is just that you decide beforehand that your game stops after capital T trials. So of course, you will do very different things depending on the horizon. Suppose that I will tell you, we are going to play just twice. There are going to just be two coin flips. Then depending on that, I mean, your strategy is rather obvious because you flip once at random because you don't know anything. And then the only information you have is that either you get a one or a zero. And then you might make some argument, okay, if I got a one, then it's more likely that this is better than the other. Of course, it's a very crappy assumption, but that's the data you have. So with a very short horizon, you choose this kind of strategy, which is very exploitative in the sense that you exploit the current information to the maximum. So since you can only play two times, you will not say, okay, I got a one in the first trial for this coin. So let's go back to the sequence I created here. And let's assume that we are just playing for two times. And at the first trial, which we did at random, okay, we picked either one or the other, we get, for instance, A equals one. Then what you would choose as a second one, given only this information, you will say, okay, I'm going to play again A because that's the only information I have. I don't know anything about the other. So why should I care exploring what the other one is doing since I'm dying after the next step? Okay, now we got, okay. Okay, clear. Thank you. The longer the horizon, the more you want to invest in exploration. Crystal, clear. Thank you. Okay, sure. Okay, so this was just a glimpse. Don't be worried if you don't get all the details, because we were getting many, many times over this multi-arm bandits over and over with the better algorithms, better ideas. This is just to get you a feeling of the kind of problems that reinforcement learning wants to deal with. So before we make the pose, I just wanted to tell you a couple of things, because this really seems like an academic gain. Okay, so something like it's being contrived by the mind of an obsessed statistical mathematician who wanted to formulate its own toy model, which is actually an excellent toy model because it's absolutely non-trivial, yet it's simple to define. So it's sort of a prototypical idea of what a model should be. Simple to define, reach in behaviors in complexity. But I wanted to tell you that there's more to this. Okay. And for instance, multi-arm bandits are connected to much more serious issues, like ethical clinical trials. So I don't know if you know anything about how clinical trials are conducted, but they are conducted very much in the spirit of explore and commit algorithms. So for instance, suppose that your options A and B are two different kinds of drugs. Okay. Or maybe it's just drug A and B is a placebo. Placibo is something which does not have any physiological effect, but it's given as a control. So how do clinical trials work? Well, you first choose a certain number for your population beforehand. Okay. That's the capital T. And then what you do, you just, in fact, you explore all the way to capital T. You don't do any commitment until you have finished the trial. That's why some trials last for years, especially when you're dealing with rare diseases, because you have to wait for data to come in, a new patient, and then you give the treatment, and then you observe the outcomes over a long time. Okay. So it's not just a simple coin flipping, but the class of programs is the same. So if you do so, you realize that the current way of doing clinical trials through this double blind approach is highly unethical. Why is it unethical? Because you don't switch to an exploitative behavior when you, in fact, could have the information for it. Suppose that you're very lucky, and it turns out that your drug A is extremely effective against the placebo. So there's a strong positive effect. But according to the current protocols, you have to run through all of your trials before you can stop and decide. By doing that, of course, you are being unethical because you're not giving the drug to people who could have received it, both in the trial and both stopping earlier and then moving for deployment to the full population. So multi-unbending algorithms could be used, even though they haven't been yet, to improve clinical trials. This is a proposal which has been put forth several times in several years, but it faces some resistance from the medical community for reasons which are not entirely scientific. Sorry about this example. Can I ask a question? Sure. It's very practical. There's no theoretical question, but how do you switch to a treatment to another? In actual clinical trials. Yes, exactly. So I understand from the placebo to the actual. You do full exploration at the beginning. So as a patient come in, there's a coin flipping which decides whether to give the drug or the placebo. And this is attributed without both the patient and Dr. Noe. That's why it's called double-blind. Okay, but you don't change in a single patient. You just decide which treatment to the new one. You don't even update what you do. Everyone that comes, you just decide if you go in the class A or class B. And you are probably oblivious to that until the end. Okay, now I assumed that you were changing the trial to the same patient. No, just one patient comes and then gets up. And then forever is going to get that treatment forever until the end of a trial, of course. That makes sense. Thank you. Thank you. No problem. We put the charger on to this device. So this is sort of a more practical motivation to study these problems, which makes them also more interesting. And there's a second one, which is also interesting and funny and lighter. So we can close with something which is not so dark, also given the times we live in. Which is the problem of recommendations. So I imagine that some of you might have a subscription to Netflix or Hulu or Amazon Prime. And every time you open a page, you get a screen of possible recommendations for you to click in. So you may be not realizing this, but this is a bandit problem. Only the rules are reversed in the sense that you are the coins, or so to speak. So you provide the feedback to the player. And the player is Netflix, Amazon, whatever. So the player decides to offer you a set of opportunities and you pick one of those. So every recommendation is like a slot machine. And you decide what slot machine to pick. And then given the feedback, there is an algorithm which decides how to buy us the proposals of possible choices in order for you to click the most. The clicks are what is winning, what is amounts to be winning. Of course, the algorithm doesn't work only based on clicks, but it also works on other contextual information that you provide by all the information about the kind of movies that you see, or other information that comes from other external sources. So it's not strictly speaking the kind of bandit problems that we've been discussing here, but it's sort of an improved version of this, which is called contextual bandits. So basically, contextual bandits are decision problems like before, only that you receive some side information on top of that. And contextual bandits are sort of a step between the simple bandit problem and the full reinforcement learning problem that we will address. Okay. So we will be falling back on this again later on. So just to tell you that as simple as they are, these multi-unbanded problems have implications for very, very diverse fields from computer science to medicine and to decision making and many other applications. Okay. So if you have any questions to ask now, otherwise I think it's time we take the reserve break. Okay. So let's meet again at 10 past 10. Is that okay? Perfect. Thank you. Enjoy your break. Thank you. Sure. Thank you. Thank you. So for the second part of today's lecture, we will discuss other two examples, which are very, very different spirits from the bandit problem. And the first one is called the cart pole. What is the cart pole? Well, it's a cart with a pole on top. That's particularly complicated to explain. So this is a cart, and these are wheels. And on top of the cart, there is a hinge. And on this hinge, there is a pole, a rigid stick. So it's a mechanical system. And your mechanical system lives within a certain domain. So let's just say that there are two walls here and there. And clearly, if you want to have a pole sticking out on a hinge, okay, this is obviously an unstable configuration, mechanically speaking. Okay. Because if you don't apply any force and a small perturbation will make it fall on either one side or the other. In this one-dimensional version, it can only fall on the sides. But the goal of this problem is to maximize the time in which the pole doesn't fall. Okay. So it can be in any possible position, which is not flat for the longest possible time. So suppose you decide that, again, you have a time horizon capital T, and then you compute your time T such that, let's say, let's introduce the angle theta here with your horizontal, such as theta. So let's say we want to define, let's do things proper. Let's define a time tau, which is the maximum over T of the times such that theta at times T is strictly larger than zero and strictly smaller than phi. Okay. So you start with theta equal by halves. And then as long as it's not on one, on either side, you keep on counting time and then you stop your episode. Or if you go all the way until tau, then you're happy because it means that you went until the end without falling. And of course, you cannot get out of the boundaries. And the things that you can do on this system is, of course, apply some force on the cart. Okay. So the idea is that if your pole is falling on the right, and you give an acceleration, you can put it back in the vertical position. That's what you would try to do when you try to balance an unstable object. So you try to move around in order not to let it fall. So the idea is to find out what is the best way of applying forces, which of course, forces can only be comprised, say between a maximal force in both directions. You just put some sensible limits, some sensible constraints on your problem, and you ask what is the best protocol. So how would you go about it? Has anyone a suggestion on how to solve this problem? And has anyone an idea whether this connects to any other problem in other disciplines that you ever heard about? Does this ring a bell? For me, I think it's like we have to find the, to describe the dynamic of the car and try to, using that dynamic, we find the value of theta at the time when theta is zero. Okay. Okay. Very good. I'm going to distill. It's like a spring problem of a spring. It's, I mean, it's like a problem of an inverted pendulum. So it's connected to spring. It's certainly a problem of very simple classical mechanics. I was looking for some sort of higher level classification of this problem. Well, depending on the angle and the acceleration of the object, you have to make an action, whether it's to go right or left, I do reckon, with a certain intensity, obviously. Exactly. That's a formulation of what you're looking for. What I was trying to evoke in your minds is that this is what is called, in engineering, a control problem. So you have a system and you want to control it. Now, reinforcement learning and control have a clear overlap reason, region. And as a matter of fact, most of the control problems fall within reinforcement learning, which can be thought of also as a glorified version of control theory in some, in some situations. But it's much more than that. The reason I'm calling in this example is that when you try to formalize this problem, okay, so one possibility to go about it, and now I'm translating in my own language, my mathematical language, what Boris was saying. Hi, did you have a question? Yeah, please. No, I just thought of something. Maybe in robotics when we're trying to make sure that if we make a robot and we want it to stop somewhere or not fall off, let's say a staircase or something, it needs to know where to stop. So maybe that problem is similar. This is also connected. Of course, in order to control a system, you need to have some information about its state and the environmental state. So this is a certain very important, and we will discuss this in great detail in the following. But let me go a little bit along the lines of what Boris was suggesting. So one possibility is to describe the dynamics of this system somehow, okay, and for instance, this somehow could be Newton's laws of motion, okay. So we might want to describe this as a physical system, and we want to write down the equations of motion for the system. So the first step would be to identify the degrees of freedom. So in this particular system, for instance, the degrees of freedom could be the horizontal position of the cart, the angle, and the associated momenta, which are just the derivatives, the velocities, okay. So degrees of freedom of the simplest description of this model would be x theta, x dot theta dot, or these are the times the derivatives. So let's call this a set of variables like q, so the generalized coordinates, and then the Newton's laws of motion would be something like q dot equals some function f of q, and all the external forces that we are applying, okay, which we call by u, these are the controls, external forces applied, okay. So like you were correctly saying, the idea is that we want to choose the forces in such a way that the solution of my problem, which is a solution of this differential equation, realizes the maximum possible time of life of my system before it falls sideways. So this way of formulating the problem is essentially what is optimal control theory in engineering. It's exactly that. Why is this problematic? Why would you want for something more than that? Can you identify possible problems in deploying this strategy? So just to clarify things, let's assume that you have infinite computational power, so you can put your things in a computer, which can compute to arbitrary precision what the outcome of this system will be. I'm looking for something more fundamental. What kind of problems do you think can arise if you want to follow this approach? This is quite important because we will be discussing these things a lot in the following, so it's useful for me to have input. Is it a deterministic approach, the one we just saw? Very good. There's also a version of stochastic optimal control, which takes into account noises. You're totally right. This is a shortcoming of classical optimal control theory, but it has been overcome. So suppose that I replace this F here by some stochastic differential equation. I can rephrase this in terms of probabilities of being able to stay up, etc., and so on and so forth. So you're certainly hitting a good point, and reinforcement learning is an essentially stochastic approach, so we will take care of that. But here I'm looking for something else, other than that. I think it's because the feedback is constant. The feedback is constant. No, it's not really constant. This is something, this U actually becomes something which depends on time and non-Q. It's a function. So you can have this ability of, given the state in which you are, given the degrees of freedom, given the current configuration, you decide what to do, and also depending on time. You're allowed to do it. No, actually, it's way simpler than that, and I'll give you a hint. So in order to do this, we've been making lots of assumptions. Okay, so this is something which goes beyond reinforcement learning and beyond any problem. Is this, is it a fact that we know, we should know everything about environment? Exactly, that's the point. That's the point. We're making the biggest possible assumptions that we know everything. I presented you this model and you sort of, okay, took it more or less for granted, sounded reasonable. But if you look closely to it, even from an engineering viewpoint, you could say, okay, but this is not really a rigid object, isn't it? So why should I limit myself to the description of only these degrees of freedom? And what about the wheels on the ground? Is there friction? Is there any kind of dissipation? What about the hinge? And what happens when I hit the wall? Okay, so I cooked up a model, looked reasonable, but it's a model. And my result of the optimization will be just as good as the model is. So that's one thing. This is control theory belongs to a class of model based approaches, which can be technically simple or difficult depending on the model, depending on whether it's stochastic or not. And we will of course discuss a lot about this model based approaches. This is one thing. The second thing is the following. When we accept that this system could be not as simple as it seems. So for instance, this cart is made of some soft material or the wheels are soft. Then specifying the degrees of freedom is a problem in itself. So that formally speaking, in fact, in general, we do not know Q and we do not know F. So we don't have access to all possible degrees of freedom on the system. And we do not have access to the laws of motion that describe the system. We can only make assumptions. So a big question that we will address is can we learn to control a system like this without knowing what all the degrees of freedoms are or having access only to a limited subset of them, which means can we address partial observability and can we address model-free problems? So can we control the cart without exactly knowing what all the degrees of freedom of the cart are and what without exactly knowing what the equations of motion for this are? So what is your guess? Can we do that? Yes, I guess. Yes, of course, because every time you ride a bike or you drive a car, you are exactly showing the principle. You don't need to know exactly how the car or the bike interacts with every detail of the environment, which may be extremely changing. So the higher goal of reinforcement learning is to come up with algorithms and approaches for automatic control, that is for autonomous systems, which adapt to various environments without requiring the knowledge of the full state space of the system, which might be enormous, and without requiring a specific detail about the system. Let's be clear on this. If you can observe more, that cannot harm, even it can distract. So this is something which is very important in learning. Having large observation spaces from the viewpoint of information cannot be bad, but from viewpoint of learning, it can distract the learner, so it might take more time to learn. And on the other side, if you have a good model of what happens in the environment, you should use it, because this will improve your learning. So a key thread in all these lectures is that we will move from situations where what to do when you have a good model, so expanding on the idea of optimal control theory to a wider class of problems. And then we will sort of add step by step concepts like what happens if I have a good model, but I only have partial observability. What happens if I don't have any model, but I have full observability? To culminate eventually in the situation where we try to control a system in which we observe partially, and we don't have a model of the environment. Which is the only hope we have if we want to control, for instance, agents in video games. Which is, I mentioned this because this is one of the greatest successes of learning personal learning theory lately. Very complex environments with many agents, largely unobservable, and of course, lacking basically any kind of model of online behavior. I have a question. Please. Which is a little bit philosophical? Oh, good. I like that. So does this imply that reinforcement learning doesn't add too much to our knowledge from epistemological point of view, or our information in general? Okay. You have to give me something more. Can you please spell out the questions more in detail? Yeah. So, I mean, it's like in this case, we don't need to know the degree of freedom, or the forces, or anything about the environment. We can just learn from the environment. Yeah. So, yeah, does this mean we are not gaining any information about this system itself when we are using reinforcement learning? That's correct. Okay. So this is very useful and very important, which brings me, I think to the, okay, can I ask you just to wait Okay. I mean, it's because I'm going to answer in great detail to your question with the final slide that is the introduction to the future slides. Okay. It's absolutely on the point. Okay. So do you have any specific questions to this cart and pole system, or do you think that this outlines enough the kind of, okay? Okay. The third example that I want to discuss with you, it's the problem of cleaning robot. Okay. Confidentially, we can call it Roomba, but it's not quite because it's a trademark, so we're not going to write it. But the idea is that there is one of these pleasant-looking robots, which is a disk, which has as a task to clean some room. And this robot is equipped with sensors, okay, which can detect obstacles. It has motors to displace itself to rotate. It has an aspiration mechanism to get the dust out of the ground or it can even suck out water. And it has a battery, of course. So at some point, it has to go to some charging device here. Okay. So this is one of many instances of a class of problems which are called navigation problems. So again, this is another instance of a problem of control. You want to send the device where the dust is in order to cover the ground as uniformly as possible to have some memory of where it has been visiting lately, because it's very unlikely that there will be dust if you just passed by in the last half an hour. You need to have a map on the environment, which is a model, by the way. But this model, this map, this environment is changing, so you have to have some sort of dynamic map of the environment in order to know where to go. You see all these kinds of problems are problems that robots have to face, but also animals, humans and non-human. And even microorganisms actually navigate in space. So it's a huge class of problems, so basically directed movement, goal directed movement. And clearly, these kind of problems must, in a sense, also fall inside the class of problems that are dealt with by reinforcement. So one of the first things that we will do is we would like to develop a language, a mathematical language that can encompass all these different phenomena. So we don't want to use a technique for bandits that doesn't apply to dynamical control problems, that doesn't apply to robots. We want to find an overarching mathematical description that we can specialize to different situations. So this is the purpose of the first lectures. And once we have that, we will see how to address specific situations. So coming back to the question by Idris, yes, thinking about reinforcement learning is thinking about epistemology. It's thinking about we as intelligent agents and machines that we want to be intelligent, how they relate with the environment. So I like to represent this problem of interaction with the environment to a diagram, which will also serve us as a map for what our future steps will be. So what I usually do is that, remember from the example of the cart pole, there are basically two axes of knowledge, okay, which are represented here. One of these, let's put it on the horizontal, is what we can say it's our knowledge of the model. So how much we know and how much we are confident about the laws of motion of our system, how much we know about the parameters of the cart, what are the laws of motion that describe it, if there is friction or not, okay, all these details. Then on this axis, I can put what I could say is the observability. So given a certain system at hand, how much can I observe it? So what are the degrees of freedom that are observable that can be used and put to use for control? And what we will do basically in the following lectures, we will move around in this map. And so it's useful to sort of start outlining these aspects. And also this highlights the connection between model-based and model-free approaches and perfect and imperfect information about the state of the system. So what we will do in the following, in the first set of lectures, we will be sitting up here in the top right corner. So what is sitting there? Well, this is the very happy situation where you have a model and you can see all degrees of freedom. So for our cart pole, basically it's the situation where there's model here in which we specify these four degrees of freedom. And we specify the dynamics through certain simple Newton's law of motion. And we put all the parameters that we need. So the masses, the moment of inertia, everything, everything is known. We describe everything. And this is a good description of the system. So we know everything. We know the degrees of freedom and we know this. So this is a problem basically of optimal control. But now we want to expand to a general situation which does not include only mechanical systems. And the way of doing this is through a theoretical device which goes under the name of Markov decision processes. So Markov decision processes are essentially a general framework which describes decisions in a situation where you have a model of the system and you know what the true states of the systems are. And the true states means that they obey some Markov property. What the Markov property, I remember you, is that the fact that if you know what the state is, you don't need to know what happened in the past in order to predict the future. The knowledge of the state is sufficient to predict what will happen in the future. Because as you have understood so far, every decision process is a process about what is happening in the future. So one key thing to reinforcement learning is that you must predict. You must predict what will happen in the future given your past information and you must control. So learning, predicting and controlling are the three pillars of reinforcement learning. So when you are in the upper corner here, you don't need to learn because you already have a model and it's a good model. So the only thing you have to do here is planning. It's a purely planning problem. Given the current situation, I can predict the future. If the system is stochastic, I can predict probabilities of occurrences of future situations and then I can decide what sequence of actions I have to do. Clearly, epistemologically speaking, here is the situation where you are in total control. If you move down here, you move to another realm in which you have a good model but you have only partial observations. But still the model is good. So here is the situation where you plan with uncertainty. So the model is known but some parameters of the models are unknown to you. Just a question. How do you know that the model is good if you don't have good observations? Okay, I will tell you. For instance, let's go back to the coins problem. When I tell you that the coins are Bernoulli, I'm already giving you a model. But you don't know what the biases are. So your problem is to infer the parameters of the model because there are other versions of the bandits, for instance, in which the outcomes are not just zeros and ones. So suppose that I flip a dungeons and dangles coin with an arbitrary number of faces. So your outcome would be anything from one to infinity. And the distribution of these outcomes you don't know. So in that case, you don't have a model. So here, when you add this partial observability, the basic tool that we will use is combining this decision process theory with inference. If we add inference to the game, we get to what we call as partially observable Markov decision processes. So in all this situation, again, it's a problem of computation. You know your model, you know your state, or you know the probability of being in a certain state. And you can compute what will happen in the future and therefore plan accordingly. But what happens if we move towards the left side of the diagram? That is the interesting part where we lack a model. So all this knowledge that we had here, so which we can also sort of describe more technically in what is known as the epistemic knowledge about the system, which is higher here, we have to replace it with some other knowledge. Because we cannot control our system if we don't replace this knowledge with something else. And what kind of knowledge do we replace? Well, we replace it with empirical knowledge. So what is empirical knowledge is what you get from data. Here on the right hand side of the diagram, you don't even need the data to decide because it's enough to predict you are covered for all possible situations by these techniques. But when you lack epistemic knowledge, which is the usual situation, you have to integrate it with data. And when you move all the way to the left and you are totally lacking any model, so this is the where the model free, sorry, this is a model free part of the diagram. Then you will have to result something else. And this typical something else that you use is trial and error. So you do things. You interact with your environment. That's how you learn to ride a bike. You don't construct a model of what will happen in the future. You do things, you correct for errors. And we will spend a lot of time in this upper left part of the diagram in which you are totally model free, but you have good observability or perfect. And then of course, all serious important problems in machine intelligence lie here at the intersection of this where there is the full reinforcement learning problem in which you have partial observability, but don't have a model. So this is essentially the outline of what we will be doing in the following. Of course, everything of this will be put into mathematical language and algorithms. And does that answer the philosophical question? Yes. Okay, good. I think it's a good time to stop or time and place to stop. If you have any other question, burning once, I can address them here. Otherwise, I stop the recording and then we can have. I think there was the matter of the references and books that you might suggest. Okay, very good. So I will list them here. So you add them. Sorry, can I make a question about the diagram? Sure. In our diagram, the case of machine learning, we set it at the top left in the top left part of this diagram. There are different things in machine learning. For instance, Bayesian machine learning sits at the right because you have a model. Have you ever model in terms of likelihoods for your events? Whereas, for instance, regression or classification, they are on the model-free side of the problem. Okay. So we put it at the top right. Well, I mean, yeah, observability in this, in other machine learning approaches is a sort of less obvious concept than in reinforcement. So I mean, apart from the horizontal axis, which is clearly something that applies throughout all kinds of knowledge, basically what I'm saying on the horizontal axis is the old philosophical dispute between rationalism and empiricism, if you wish. But about the vertical axis, the problem of observability is a problem which is intrinsically related to the dynamic nature of the problem. You need to know the degrees of freedom in order to predict, if you wish. Okay. So I'm not sure I'm really able now to sort of classify different, and again, I wouldn't call that machine learning because reinforcement learning is a part of machine learning, but other supervised or unsupervised problem of machine learning along the axis of observability. I have to think about it. Okay. Thanks. So in this case, reinforcement learning is a sort of subset of the machine learning? Yes, it is. It's a branch of machine learning which specifically addresses two questions that are not usually included in supervised and unsupervised learning, which are the dynamic character of the system or the data. So doing things, interacting with environment. And the second thing is controlling the system that is acting on your source of data. Okay. Thanks. Okay. Maybe if you want to know more about the differences between machine learning and classical machine learning and reinforcement learning, I can give you a brief outline in the next lectures if you ask me to do that. Okay. So references. So there's one book which is the go-to book for reinforcement learning, which is the book by Richard Sutton and Andy Bartow, which is called reinforcement learning and introduction, which as you will see follows a very different road from the one that I'm using for my classes, but every page of this book is worth reading. It's not very technical, so you will basically find very few proofs or mathematical statements. And those that are there are really at a very, very introductory level. We will do something a bit more substantial on the mathematical side. Almost never at the mathematical level of rigor, but at a slightly more advanced level than Sutton and Bartow's book. The latest edition is pretty available. Okay. So if you search for Richard Sutton web page under, I think it's called incompleteideas.something, probably.com, but I'm not sure. On this website, you will find out the latest version of the book. It's only missing the solutions of the exercises, so you don't need to really to buy it. We have a copy in the library anyway. Okay. So this is the go-to reference. Then for bandits, there is a book by Latimor and Sepezvari. It's called multi-armed, it's just bandit algorithms. This is an extremely good book, but it's a lot on the math side. And you will find there are proofs and algorithms spelled out. It's an excellent book, but don't be worried if you can't really follow it very easily, like the first one. That's Mooma. But that's the current state-of-the-art description of what happens with bandits, if you're very interested in the subject. And then for other subjects within this diagram, there are not real books, there are reviews and papers that I will point to them along the way. But if you can read this, then you're in a very good position. There are also some other reviews I can share with you by email separately. Okay. Okay. If there are no further questions so far, I'll cut the recording off.