 All right. So the plan for today is to put some flesh on the qualitative arguments that we put forward yesterday. Not very much actually, but a little bit more in order to progress a little bit towards the goal for this week, which is to set up the entire mathematical apparatus that is needed to discuss decision-making processes. So for today, we will have a little bit of a mix of qualitative and more mathematical arguments. Again, we will start quite, quite gently. So in response to a question that was raised yesterday, which I think it's both discussing in greater detail I would like to spend a few minutes by comparing different partings for machine learning to see in what exactly reinforcement learning is different. Always reminding you that eventually when you have to face a challenging problem like learning to play the game of go or learning how to control agents in very complex environments, you actually will use often a mix of the different paradigms. You can leverage on supervised learning and unsupervised learning in order to strengthen your reinforcement learning approach. For some purposes, it's actually necessary to resort to ideas from other paradigms. So don't think of this as compartmentalized objects, but I will now highlight the differences. So not exactly what I wanted. Okay, this is really a rebellion. Okay, good. So I killed part of the text here. So as a forward, we'll discuss how to compare different machine learning partings. So I will make a very, very superficial and short recall of what the different main partings are. Again, boundaries are blurred. Okay, so I'm oversimplifying here. But in a nutshell, when you want to face a so-called unsupervised learning problem, the kind of questions that you're asking can actually be formalized in the following way. You have some data and each of this item in this data list is a vector which belongs to some possibly high dimensional space. Okay, each of those. And in very, very simple terms, the question that unsupervised learning asks is whether I can find some representation in terms of probability distribution for this data, which can be conveniently parameterized by a parameter which belongs to some space R, M. And ideally, I would like this to have a much smaller dimensionality than the initial dimensionality of the data set itself, okay? So said in plain words, we would like to unveil the structure of our data. We would like to compress them. We would like to organize them, okay? So the goal of finding an effective representation for this data is essentially this task, okay? And then there might be different techniques, essentially clustering data, dimensionality reduction, but all obey to the same principle, find some compact representation of your set of data. Supervised learning actually is not entirely disconnected from the task of unsupervised learning. So the idea is now that your data come in couples, in pairs, which belong to some space R, D, D prime, each of these pairs. And at a very general level, the supervised learning, unsupervised learning objective for this data would be similar, right? So if you were unsupervised, you would search for some distribution for your joint data, okay? Which as of now don't have any particular interpretation as they come. In fact, if you start thinking of this data as inputs and outputs, okay? So inputs and outputs, then you can see that you can settle for a less ambitious goal, which is nonetheless very interesting. Is that for instance, you can settle for finding out the description of how the Ys are distributed given the X, okay? So here you're putting some structures in your pairs of data, in the sense that you're interpreting them as independent variables, if you wish, okay? And then if you actually scale down your objective even more, and rather than asking for the full probability distribution of this, you ask for instance for the expectation, given some set of given, according to some unknown distribution which has some parameters of Y over X. Well, this is essentially in practice what supervised learning wants to do. So this is a function of some set of parameters which is able to predict to some extent what the value of Y, the expected value of Y would be given X. Of course, you can do that if as always there is sufficient structure in your data. So this is the typical task of supervised learning. So graphically speaking, this amounts in general to say that you have some data. Okay, so here I'm plotting as a line what is in fact in high-dimensional, possibly high-dimensional space both in X and Y and I have some schedule data, okay? Like this, but my goal here is not to understand what this full distribution here is but I'm more than happy to find out what a possible function could identify effectively the average distribution of the Ys for a given X, okay? So this is what my function F of theta of X would look like. This is what is called the regression task. Not very different is the classification task only that now you are sort of restricting your function to take a finite set of discrete values which are the labels of your classification problem. So basically now the situation is like you have your data like this, okay? So suppose that there are just two labels. So you have pairs of data X in which in this case are real variables and labels which could be zeros or ones. And then again, the task is to find out what is the best description in terms of probability of belonging to one label or to another. And this is what you would call your classification task. But essentially they are the same. So this is a very, very superficial description of what basic at the very best level supervised and unsupervised learning too. So there are two things that are conspicuously missing from this approach to data. One of these is that as you can see, there is in all these things and in the way you use algorithm to deal with the data, there's very little interaction with the data itself with the data producing mechanism. So you get your bunch of data then maybe you may manipulate it a little bit you can separate it to a training set and a test set but there's no really a continuous interaction with the data. This is not something that as by itself a notion of a closed loop of a continuous interaction of online, you can use online algorithms but they are not sort of built in the problem itself. You don't need that to do that, but you can do that. And there's no clearly no notion of dynamics in the sense that the way the data are labeled need not necessarily correspond to time as we understand. There is no sort of causal structure in the data what comes first determines what comes next. So this is not necessarily in the game. Of course, you can use these techniques and tweak them in order to describe dynamical systems which goes a little bit towards the direction of reinforcement learning in the sense that you can use them. So unsupervised and supervised learning for dynamics. So then the basic idea is that you explicitly have something which is time and now your data X say XI is actually a sequence. So it's XI at times one, XI at some time, capital T and you have many of them. So if you have many dynamical systems, many dynamical sequences, there is many trajectories in your high dimensional space. You can use basically, so this one single data would be like a trajectory. So this would be the XI and this would be times going from one to capital T. And suppose you have many of these trajectories and then you want to understand whether you can describe this process in time. So for instance, one thing that you could do is to look for some parametrization that gives you the probability that I am at time X. I'm in position X, T plus one, given that I was at position X, T. In this case, you are seeking for a markup model of what is happening, okay? You're trying to describe your process as taking in steps in which the next steps doesn't depend fully on the memory but just on the previous step. And then again, you can apply either unsupervised or supervised learning depending on which situation you are in to find out what the best description in terms of parameters is, which goes a little bit towards the direction of reinforcement learning in the sense that we are adding this dynamical components, this prediction over time. But still there is very little interaction with the data itself in the sense that I told you, you get a bunch of trajectories which were generated by an Oracle or observed experimentally and then you basically tried by machine learning approaches to derive a model, to construct a model or to select possible models. So one thing that goes a bit more towards the direction of reinforcement learning, but still isn't, is what is known as active learning. So what active learning does, for instance, as a good example is an example of a classification task, okay? So let's go back to the situation we're discussing here, but now let's just for graphical purposes, let's discuss a situation in which you have, say two dimensional input data with two components, okay? These one and two now are the components of the pairs, okay? I should probably use another notation, but you will allow me to do so. So I have many data here, which are the red data and I have the blue data. And the goal of the classification task, of course, is to find out whether I can trace some boundary here or I can use some smoother description in terms of probabilities or belong into one group or the other, all right? So in classical supervised learning approach, you have your cloud of data and then you just work some technique you want to use depending on the structure of the boundary. If it's linear, you can use perceptions or support by the machines. If it's non-linear, you might be want to use some kind of trick to talk this non-linearity in the boundaries. There's a lot of techniques, but what active learning does is that has an advantage with respect to these other approaches in the sense that in active learning, you can ask for data. So for instance, suppose that I am in the current situation and I want to improve on my definition of what the boundary is. So a smart thing to do would be to query my oracle, that is my environment, my data producing mechanism to say, okay, I have some tentative description of the boundary which is this green line and I want to improve it. Then why don't you give me data here? I want more data here. So when I am more uncertain, I want data there because this data will allow me if they are crucially positioned there, they will allow me to sharpen the distinction between red and blue. I don't need any more data here. Sorry, someone has left the mic on and there was an echo. Okay. Did you want to ask questions? It's room H, it's unmuted for some reason. Okay, please just check your, if you're muted, if there's echo again, please mute yourself and then you can unmute if you want to ask questions. All right, so there is this different notion of being able to interact with the data producing mechanism by this means and which of course can speed up enormously the classification process, okay? So actually you can prove that in this kind of classification tasks, active learning can increase exponentially the speed of classification because you're asking the right questions at the right point, okay? So this is certainly goes more into the direction of reinforcement learning in the sense that when you learn by interacting with environment and we can get back to the examples of the system we've been discussing before, you can push your system to go into directions where there is a relevant information, okay? So for instance, think about the Roomba robot which has sensors in front of it. If the robot is just by chance ending up in a corner of the room and doesn't see anything, then a relevant action is just to turn itself and to look around because this gives more information about the environment than just staring at the corner, okay? So this kind of information seeking actions are very important in reinforcement learning. So in a nutshell, when we think about reinforcement learning, we are basically combining these two ideas. First of all, we want to construct a description of what will happen in the future because this will allow us to predict what will happen in the future and that's the first step. So the first step is to remember that we need to predict and this prediction can be done using a model but also not using a model that is based yourself just on previous observations to infer what will happen in the future without using a generative model or what will happen in the future. Once you are able to predict what will happen in the future, then you want to control. These are the two things that are not embedded in other paradigms for machine learning, okay? So like I said, nevertheless, we will discuss situations in which we can leverage on other ideas from machine learning and especially artificial neural networks but not for the purpose of learning in a supervised or a non-supervised way but rather thinking of them as powerful function approximators, okay? So this will be clearer in the following but I hope this short discussion helped somehow to clarify a little bit more what is similar and what is different with other machine learning frameworks. Any question on this? So now we revert to our initial agenda and the first key ingredient that we introduced today which is still not a very quantitative description of the process of decision making but nevertheless is very important in the following is what is called the agent environment interface. So we start introducing a little bit of the lexicon for to set up the decision-making problem. So this agent environment interface is an abstract description of an adaptive autonomous system. Which aims at encompassing basically everything. Everything means every problem that we discussed yesterday bandits, robots, engineering control, you name it any decision-making process should fall into this description. Of course, if you aim for such a large description this description will necessarily be very generic but also very flexible. So what is the idea? Is that we want to identify some basic ingredients in this decision-making process. And in general, so this is also if you open up a book on artificial intelligence this is basically on page one or two it's an abstraction by which we identify one entity which is the agent and the agent as you could probably know from the lexicon means the one who does, okay? So the one who does things. And this agent interacts with the environment, okay? So the environment is something that surrounds the agent but formally typically we depict it by using another box which is probably not the best graphical solution but that's how it is. So I don't wanna confuse you with other graphical choices but outside of the agent there is the environment. Now where the boundary between agent environment lies it's a delicate issue, okay? It's something that case by case as you will see that it's sometimes hard to define when you think about for instance human behavior, okay? So what is exactly, what is external to us and what is internal to us when two things part is initially in itself, okay? But for us it's a very useful assumption that simplifies a lot the mathematical description. But whenever you meet a new system you should always ask yourself what is the part which makes decisions and where it's a boundary between what is not under the control of the agent? So agent environment, how do they talk to each other? Okay, there are two interfaces between the agent of the environment. So the first interface which I will depict here in blue is the sensory interface, okay? So here we're using a terminology from engineering but for living beings the sensors could be receptors on the skin or on the organs of any living being could be any signal that comes through the senses, okay? So this sensor is an abstract interface which mediates signals that come from the environment and they are transformed into something that happens inside the agent. So these signals that go from the environment to the agent are called in general per seps. These per seps will eventually become some high dimensional vector, okay? Which contains a lot of information and in reinforcement learning it has a very specific structure, okay? Which we will discuss in due time that is shortly, I hope. But for the moment you can think that there there is all the information that is collected about the environment. So for instance, for a human it's visual information or factor information, tactile information, you name it, okay? Anything that comes in the form of stimulus is encoded in this very high dimensional percept. It can be a faithful representation of the environment or it can be just a very partial description of the environment, okay? So for instance, in the previous case if you combine all your visual, hearing, tactile inputs nevertheless, this is not a very accurate description of the full environment around you, okay? Things made there's noise, there's the vision is always limited in extent and in angle of view. So this is often a very partial view of the environment. Nevertheless, it's useful sometimes to abstract away and say that our percepts are a very clear snapshot of what the environment really is. Here we of course are sort of assuming that there are few relevant degrees of freedom that are important and all the others we can discuss them. Sometimes this assumption is correct, sometimes it isn't. Often it isn't in sufficiently complex systems. But for now the percepts are whatever comes to the agent. And then the second side of the interface is goes from the agent to the environment if you wish in the other direction. So this other interface is what goes through the actuators. The actuators are for instance, in a robot all the parts of engines and bolts that make the wheels turn, that make the sensor, sorry, that make the robot move in one direction or another, okay? Anything that produces some displacement of the state of the agent with respect to the environment, okay? So the actuators modify the relative position of agents and environments. And what the agent does, well, this is an action. This is the basic lexicon that we will use repeatedly in the following. So at this stage, it's useful to go quickly back to our three examples that we discussed yesterday and to review very briefly what are the notions of agent, environment, percepts in the three different cases. So for instance, let's start with the examples of multi-on bandits. And let's think about a real situation. So you really are into a casino, inside a casino, and inside a casino, there are many slot machines stocked around. And for instance, let's say that just for simplicity, they come in, so this is, okay. For simplicity, let's consider that you have two different kinds of slot machines, okay? So this is my depiction of a slot machine, this is the armor, okay? And you have big ones and small ones. By this, I mean that just that they come into two different brands, okay? So you have many of them, the big ones and the small ones. Okay, you may have several of them. And let's say that it's first to assume or you know, because you've been informed separately that the behavior of the small ones is different from the behavior of the large ones, okay? You still don't know what are their probabilities of winning, which is what you want to discover. But you know that these things are different. For instance, let's suppose that for each of those, there is some probability of winning, okay? So which stays from zero to one, okay? This is the probability of getting an amount of money which goes from zero to one every time you pull the lever. And for instance, you don't know what these distributions are, but I'm telling you that typically large machines have two bumps in the distribution and small machines have just one bump. This is just to tell you that there are two, really two different classes of distributions on the line by which you can distinguish them. And, but you don't know anything about those, okay? All these are unknown, none of this. You have to discover which one is best in the terms of average amount of money that you get out of these machines while playing. The problem is always the same. The reason I'm introducing this difference between small and large is because this is some sort of side information in the sense that when you develop a strategy for these machines, okay? Suppose that you say, I want to play and I'm deciding, okay, this turn you play on large machines and then you can choose which large machine you want to play. And then next turn I decide again, now you're gonna play again on the large machine or no, now you're gonna play on the small machine. Then clearly in such a situation, you may want to have two different strategies, one for the large machines and one for the small machines. So this is what is called contextual information. So there is something in the percepts, okay? So in this case, let's go through the steps. The agent, okay, is the person who pulls the lever, right? Who pulls the lever? The environment is the slot machines. So the actions are pulling the lever of one machine, one and only one at a time, okay? And what are the percepts? Well, here in this case, there are two percepts. One is the money that I get, see what I win at each turn. And the second one is what class of machines I can use. So this example is useful because it already shows you that there is and there will always be in our problems, this structure in the percepts. This object here, what I win, is connected to the goal, okay? This is something which tells me what I'm gaining here, the gain. This part here is not directly connected to the gain, but it's giving me information about the context. And I want to combine these two, okay? So this is a general thing in reinforcement learning and we will see more of this in the follow. As a second example, let's consider the navigation problem, right? So navigation, which means the cleaning robot. And it's useful at this stage to introduce a slightly more formalized version of this navigation problem, which is called GridWord. It's just a simplified version of the problem of moving your robot around in the sense that it's always the same, but now it takes place on a domain, which is tiled. So rather than having a continuum set of positions and velocities, et cetera, we just have discrete arrangements. Then basically at each time step, our robot is sitting somewhere in this, okay? So this is the position of the robot. And then what the robot can do is to hope onto a neighbor tile, okay? And then around in this domain, there are some places where there are some tiles, which there are some tiles in which you get some prize, for instance, right? So this would be the dust to collect by the robot. And then there is somewhere around. There is also your charging station here, okay? Maybe there's another charging station here. Okay? So you can put all the details you want, it's okay. And the idea is that it has to move around, it has to clean the dust in the corners and around, and then it has to go to the charging station, okay? And you can define your goal in whatever way. So clearly in this situation, it's not difficult to see that the robot is the agent. This grid is the environment, okay? Together with the prizes and the charging stations, okay? All this is part of the environment, where I can move around and where are located the relevant features of the environment, all this makes the environment. The actions are the ability to move in one direction or the others. And the percepts, well, it depends on what the agent can see. For instance, we might say that the agent has some localization device that tells it that it's somewhere around with a certain precision like this, okay? So there's a sort of range of localization, which so suppose that it has a GPS, the GPS goes out and then gets back and it gets a signal that says, you are in that location with a precision of a meter, half a meter, whatever. This is one possible way to locate it. So this is one possible percept and then there may be others like the presence of the obstacles by infrared sensors, you name it, okay? So I'm not just gonna be too detailed on this because that's not important, but there might be also some, not this, some angle of view here by which you can see things around. Okay, so these are all the percepts. So here we have been positioning all this, all these terms inside the things. Is that, is everything clear up to now? Good. So now the next step will be to formalize all these notions, these names and concepts into a mathematical framework. And this mathematical framework will be basically at the base of everything we will discuss afterwards. So every time we think about the task in reinforcement learning, we will have to have in our minds in the background the idea that there is some decision process in the form of a mark of decision process, which will be, I think that we will be discussing shortly and how to sort of identify the various actors in this process. But before doing that, I'm taking five minutes more of your time just to recapitulate once more the discussion we had last time about the amount of knowledge that we have about the system. So if you remember yesterday, we've been discussing a possible way of arranging our problems in decision making and reinforcement learning, depending on the knowledge of the model and the observability, okay? So it's useful to look once more at these two problems, the multi-unbanded problem and the navigation problem to see what it means to be in different positions of this diagram. So just to focus on one case for simplicity, let's consider this navigation problem, okay? So what does it mean to be here? Remember that we will start from here in this corner up here, up here, you perfectly know the model and you have perfect observability. So what does it mean to be there for this problem? Well, first, because you're here because you have a perfect knowledge of the model means that you have an accurate map. Means that the agent has in its memory exactly the map that you're seeing now. It knows exactly where all the points are, where all the environment, its features located. So where is the dust, where is the charging station, okay? How the room is shaped, where are the corners, where are the obstacles in the room? Everything is already known from the beginning. You can also already see some sort of problematic aspects in what I'm saying, but that's the setting. If you accept that you are here on the rightmost part of this graph, it means that you have an accurate map of what is around if you have a navigation problem. And first and second, if you're up here, this means basically that your GPS, this turquoise circle that I draw which recalls the infamous blue dot of Google Maps is very, very well localized. Okay, so you know exactly on which tile you are. This is what it means to be here. Accurate map and accurate locations. Is there any question? Okay, I heard something on the background. What happens if you move down here? Well, again, you still have an accurate map, but you can locate yourself clearly, okay? So this is poor GPS location. I don't wanna write GPS here because it's a bit too specific. Poor location. It's just like when you open your Google Maps and you have a very accurate map of what is around you with everything and you can look for whatever detail, et cetera, but you have a very, very large blue circle that tells you that you can be somewhere around in with an approximation of 100 meters, which might not be good enough for you if you have, for instance, to reach one specific location. And what it means to be here in this location? Well, if you are up here, again, this means that your location is good, accurate location, but you have a very poor map, okay? Poor or no map, which is more of a situation that starts making sense. In the sense that suppose that you are in the woods, okay? So you have a very perfect location somehow. I mean, your GPS is locating you within five meters, but the map has no features there. So there's no way of finding out in which direction you should move because you don't have any local feature. So you will need something else in order to orient yourself. But this is also a situation that is very often present in practice. And then when you go here, which is the place where the full reinforcement learning problem is, this is the usual situation, if you wish, in which you have poor or no map and location capabilities. The idea is that in the following for generic problems, but you can always go back to this problem and navigation to fix the ideas if you want. We will start describing what you can do if you have an accurate map and an accurate location. And what you do basically is you ask your algorithm to compute the best route to the target, for instance. So this is the kind of things that happens in practice when I say, okay, I am in a certain location in Miramar Trieste, I wanted to go to Montezón Colón, what route should I take? I have locations, I have maps, I have all sorts of information, it's a computation, it's a planning problem. And so we start from these problems first. And then we move to problems in which, okay, we have planning, we can do planning, but my location is somewhere comprised between Mont Falcone and Muja. Can I still plan? Okay, maybe I have to sort of consider a distribution of possible routes depending on where I start. So I can still be able to do planning by sort of upgrading my problem from the level of deterministic paths if you wish to more general probabilistic inference. And then if you go into the upper left corner where you know where you are, but you don't have much information, then you will have to resort to different techniques including empirical knowledge in the process, okay? So we will discuss all the steps, trying to put all the pieces together and eventually trying to approach the most interesting part here by taking the various lessons that we've learned along the way, okay? Good. So good time to take a break. Sir, can I make a small question? About, there is one thing that I haven't understood very clearly, the knowledge of the model. So in this case maps to an intrinsic quality of the environment or is a quality of the perception apparatus of our robot. Okay, yeah. So in this case, because you say that when you were talking about the low left part of the model, you say that maybe this part of the map has little features and cannot be recognized. So is it about the environment or about, yeah, I mean. It's about, yeah, yeah, that's a very good question. It's about both, okay? So if you are here, it means that you are in control. If you are on the right-hand side, it means that you have a good model of where you are, of what you do. That is, if I take an action, what will be the consequences of my action, you know that as well. So for instance, if you're a robot and say, I want to go east, but the robot skids and goes a little bit to the side, you know that this happens with a certain probability as well, you know that already. So you know the consequences of your action and you know also the properties of your observation system. So you know that if you are in a certain state and you can make a certain observation with a certain probability. So this is part of your model. You have a model to describe all your interactions with environment. So it's a property which combines the environment and the way you interact with it. Having a model is a very powerful assumption. It's right on the fuzzy line that distinguished the two entities, basically. Another thing. Outside, yeah. So while we are talking about this particular problem, for example, we saw that we can be at any point inside this graph basically. We can think of different. So when we think of this as moving through, solving this problem is a path through this graph or we can think of the single problem instance as being located somewhere in the graph and staying there put. Okay, so the formalism that we will describe includes all sorts of situations. So we can, as you will see in a form of description, it's in general something which happens on a graph of abstract states and you move around in this graph of abstract states by taking actions that can lead you close or far. So it's, in a sense, the framework was gonna be extremely flexible with respect to this. I chose this particular example because it gives some sort of intuition about what kind of options are thinking, we're thinking that. But the framework is, did that answer the question? No. Yeah, no, yeah, I think, yes. These are, maybe I was distracted by the answer in any case. So the point about this abstract space is what takes in consideration the state of both the agent and the environment is abstract. Yeah, when we talk about the state and we will do that shortly, you always have to think that it's a property of jointly the agent and the environment. So if you want, it's a relative state of the agent with respect to the environment. Just to give you an idea, a very, very simple example. So I'm sitting, I'm the agent in my room, okay? So I can do this action, which is push my seat back, okay? So you can legitimately say that all the rest of the environment hasn't changed, okay? But my state with respect to the environment has changed. So in this sense, we will say that there has been a change in the state. Then there are other ways of operating changes in which, for instance, I take this pen and I displace it one meter to the side. And then in this case, you will say, okay, but this, you have changed the environment. And I'm arguing that, and actually that's the common understanding of this thing is that both things are changes of state in the sense that what matters is the relative position of agent and the environment. So we do talk about the state in general, maybe more than about the single entities. Yeah, we talk about the state as a general object. Okay, yeah, that makes perfect sense. I mean, it's always the fuzzy distinction problem because so we say, yeah, it's a state. One has to be careful, yes, I agree. Thanks. Sure. Any other question? Okay, if not, we take our usual break and we reconvene at 10 past 10. Okay, thank you. Later. Professor? Just a second.