 Welcome back everybody. It's my pleasure to chair this session for this afternoon and this will be given by Florian Marquardt from the Max Planck Institute for the Science of Light in Allangen and He will be talking about reinforcement learning. So please looking forward. I think you're smart Thank you Marcus for the kind introduction. So I want to get an impression of who you are. So how many of you have already trained a neural network? Most of you. How many of you have some experience in reinforcement learning? Okay, there's a bunch. Maybe you keep quiet when I ask the questions to the audience. So this lecture is supposed to be relatively interactive. So from time to time there will be questions that I ask and then you will be asked to discuss with your neighbor and come up possibly with an answer. But let's just get started. This is a famous example that I want to briefly mention. So what you're looking at is of course the board game of Go, one of the more complex board games, even though the rules are extremely simple, but you can put these white and black stones at any point on the board and you try to encircle the opponent's areas in some way. The details do not matter. Now all of you probably know that in chess computers have become better than the best human players already in the 90s. In Go this was not yet the case and the reason was that in Go you have so exponentially many more possibilities of placing these stones. So it was believed until a few years ago actually that it would still be quite a time until computers became better than humans in this complex board game. Nevertheless back in 2016 the company DeepMind that most of you are aware of they released a computer program based on neural networks and in particular based on this general set of techniques called reinforcement learning and this computer program was able to beat the best human players. In fact this success was so sensational that one of these world's best human players made this remarkable comment so I let you read it but basically he was saying that he would go as far as to say that not a single human has touched the edge of the truth of the game of Go and only by seeing the computer program play finally humanity that has played this game for I don't know thousands of years has finally understood that there are better strategies out there than those that have ever been invented by humans. So these obviously are strong words and we will learn how this set of techniques works and how it can be applied. Now since you already had some introductory lecture on neural networks most of what you've seen really goes under the heading of supervised learning. So supervised learning you can think of as having a very smart teacher this teacher knows all the answers to all the possible questions and he gives a sampling of these question answer pairs to a student. So he says okay if you asked this question in which country does Trieste lie you have to answer Italy and if you ask this question then you have to answer that and so on and eventually the student will be able to imitate the teacher and maybe even extrapolate slightly from the answers given by the teacher so that the student can also answer some questions that were not in this training set but maybe they cannot be very far removed okay and so very obviously here the final level is probably limited by the level of knowledge of the teacher the student cannot become much better than the teacher in this setting. However if you want to become ambitious if you are an ambitious student or let's say a scientist you certainly want to become better than your teacher eventually you want to become better than I am in reinforcement learning so how do you do that and so the only general technique we know is basically trial and error so if no one tells you what is the right answer to a problem you try out one thing you try out another thing and you again fail in it you try out a third thing and maybe in this third attempt you get something a little bit right and then you keep this strategy of solving a certain problem and you start to modify it a little bit to see whether you can do even a little bit better so in a way it's trial and error combined with this concept that if you have encountered a reasonably good strategy you try to modify it and you try to reinforce the good actions that you have taken so that's the origin of the word reinforcement learning you reinforce good strategies and make them even better and here hopefully the final level is unlimited because there is no teacher there is no one telling you what to do step by step you're finding it out on your own okay so at its heart reinforcement learning is really about discovering strategies strategies to solve problems for example for a self-driving car or a robot to solve the problem to go from here to there in the shortest possible route and there is no teacher who's already telling you which steps to take you have to figure that out on your own so you would observe the immediate environment maybe because the robot has a camera built in and then the robot has to decide whether to move to the left or to the right or forward but again no one will tell you what's the right solution famously as I just mentioned you can apply this strategy discovery to the domain of games so in a game you would observe the board let's say the go board and you would have to decide where to place the next stone and again there is no one teaching you what to do but you certainly know when you have done right at the end of the game because that's when you win and then in physics we can measure a quantum system let's say observe things and depending on the observations that we make we apply certain controls maybe we send in some microwave fields to the quantum system or we control a complicated plasma fusion reactor by changing the electrical currents and the coils that control the plasma and again this is of the same kind and again there would be no one telling you what is the right strategy you don't want to just imitate a known strategy but you want to discover the strategy that is best to say preserve the quantum information for the longest possible time in your quantum computer or to stabilize the plasma for the longest possible time so this is all about reinforcement learning and so here's a picture of how you would imagine this to be in the quantum realm you have your what we will later call an agent here depicted as a neural network trying to steer the quantum computer to do what you wanted to do okay so let me start by telling you about the basic setting of how we formalize this set of challenges when we think about reinforcement learning how do we even think about reinforcement learning so you see here a little robot roaming around the world and maybe interacting with the objects in the world and from the point of view of a computer scientist you would abstract everything away and say the little robot I call my agent it's an agent because it can decide to do things on its own it's not purely passive it can really act and the rest of the world I call my environment in the sense of reinforcement learning so it's whenever I talk about quantum physics it's giving rise to some confusions because in quantum physics as many of you know you talk about environments when you think about dissipative quantum systems and you have a little qubit and it's interacting with the environment of all the rest of the world that leads to dissipation here that's not the case the environment is really already everything besides the agent and the agent is trying to do something with the environment and so the kind of loop the feedback loop we are thinking about in reinforcement learning goes like this the agent observes the environment maybe it has a camera or maybe it only gets a few sensor inputs and based on this observation it now has to decide as I said before what's the next action what should I do again in different scenarios there will be different actions available in this scenario when I have these blocks that I can stack maybe I can move a robot arm or can decide which block to stack upon which other block so these are all actions that will eventually also change the state of the environment so I take in the observation and I try to map the observation to the next suggested action and then I will take this action the environment will therefore change I will take in the next observation and so on and this mapping from observation from the observed state of the environment to the next action that I want to take that's called a policy or you could also call it a strategy but in this field it's called a policy so a mapping from state to action and then you go on and on and on and the thing that the thing that you need in the end is to define what is good what counts is good and that is where the so-called reward comes in but let me make it really clear in this setting again I have a robot moving around a two-dimensional world maybe it wants to pick up these boxes and so the state that the agent takes in might just be the position of this robot as measured in its coordinates that's one possible choice of a state it could also be this full image that we are seeing that's another choice of a state and depending on which choice of state you make what kinds of observations you have you can do more or less for example if these boxes are always at exactly the same three positions whenever I start this game they are the same positions then it is sufficient possibly to give as a state the current location of the robot and it will eventually during training that we will discuss figure out a good path to move between these boxes however if these boxes can be at arbitrary locations every time I restart this game they are at a different location and it would not be sufficient for the robot only to know its own position because it also wants to know where are the boxes in order to plan ahead and figure out the best path towards these boxes in that case you would want the observed state to be the full image for example or maybe the distance to the boxes maybe there's a kind of radar built in so depending on what is the state you can do you can solve more or less complicated problems and so that's one of the first things you have to decide when you decide to do reinforcement learning what's the observed state that my agent is supplied with maybe there are physical constraints because you have built this robot and it doesn't have a camera and only a kind of echo sensor or something and maybe there are also constraints because once you get this state and feed it into your agent it has to process it somehow maybe this agent is a neural network and if it's overwhelmed by the state information maybe that's also not good so okay that's the state and then the action in this example would be pretty clear it can just move into the four different directions so these are the actions it can decide on but in order to yeah in order to teach it what you wanted to do as I said this is not supervised learning you will not tell it in each individual step oh now it would be good to move up now it would be good to move to the right yeah that would be supervised learning and there would be no reinforcement learning necessary you could do this if you know the strategy but if you don't then it's not a good start so in order to formalize what you wanted to do you have to give it a so-called reward you have to tell it okay whenever you do pick up a box you get a reward of plus one and if you pick up all three you get plus three at the end of the game and so there's at least a little bit of indication what you wanted to do without revealing any strategy because you haven't figured out the strategy on your own you just want to say what is the desired end state okay is this so far clear that's the setting of reinforcement learning seems clear I see nodding so yes so at least we know what we want the question is how we get there okay so the first thing I already mentioned the correct action is not known this is not supervised learning we have to do something else and I already mentioned the idea we give an indication of what we want in the end by giving a reward for example in this board game of go you get plus one if you win the game minus one if you lose the game that's at the very end or in this game of picking up boxes the reward depends on the number of boxes that you picked up say in a given time and the question now I want to pose to you is how could we optimize the reward yeah if you don't know anything else and maybe those who know about reinforcement learning they should not participate now but get together with your neighbor and discuss maybe based on things that you have learned before in this series of days of lectures how could one go about optimizing such a reward if you don't know anything else yeah but you're faced with this problem you know there's a reward that is defined you know you can take actions how would you go about it maybe you discuss for five minutes I'm sure you will come up with some at least naive solutions well really get together and discuss even if you don't come up with a solution really please discuss with your neighbor if it's in any way possible and just think about it concretely there's this little robot at each time step it can move four directions it wants to find only one box that's already good enough it will get a reward maybe just for a few time steps it moves so if you don't know anything else if you want to brute force solve this problem what could you do okay so maybe still one minute and then we wrap it up and then I will ask for suggestions okay so let's let's discuss I saw there were vigorous discussions among you so who wants to call in and have some suggestion of what you should do if you don't know anything about this reinforcement learning business what's the simplest things that we could do anyone wants to volunteer something yes interesting okay so I think you're already one step ahead so so I think you were thinking initially about just doing random walks and seeing whether you hit something right that's the starting point that's the most naive strategy and maybe if you do hit something maybe you have stored this particular kind of random walk and then at least you know oh the sequence of steps will take me to the target but of course it's super unlikely if these trajectories are long and the targets are rare that you hit the right thing you are trying already to improve on this right so you are saying I don't want to always return back to the same old places that I have already visited I want to avoid them so I spread out quicker and that's definitely that's a definitely good better exploration strategy anyone else wants to volunteer anything so previously in these lectures that you had this week when you wanted to optimize something I mean machine learning is a lot about optimization right what did you typically do what's the go-to technique when you want to optimize a cost function or something yeah gradient descent now here of course you see already maybe it's a little bit tricky because the actions are discreet how can I even take gradients so that will be one of the topics so we have already identified two topics first brute force for such problems is super difficult because even in each time step I have multiple actions if I have a hundred time steps I have I don't know four actions to the power of 100 possible choices so it's completely hopeless and it gets even more hopeless if the state is not just the location or something but you're taking a full image because then for each possible image you have to tell me what is the action and that for maybe for each possible time step so so it's completely hopeless we already see this so there's an exponential explosion in what we have and also we just discovered that gradient descent which we really like is maybe not so immediately applicable because typically often these are discreet actions and so we have to deal with both of these problems now fortunately yeah so this just the same thing so fortunately we do not need to invent this at this point because it was invented already 30 years ago how to deal with that such things and so there are actually two big approaches policy gradient and cue learning that I will discuss in this lecture and still part of the next lecture and then more recently they have become merged in order to give even more powerful approaches but let's start with policy gradient and this is a technique that was invented already around 1990 and as the name already implies we will do some gradient descent but we have to be smart about it okay so this is policy gradient it's also sometimes called reinforce was invented in 1992 and it's probably the simplest or together with cue learning one of the two simplest at general reinforcement learning techniques I already mentioned something else here model free I will come back to that it's a technique that does not supposed to know anything about how the world behaves in response to your actions so it's treating the world a little bit like a black box and I now already reveal the key idea the key idea is that you turn these discreet actions into something continuous by moving to probabilities you say instead of announcing a particular discreet action deterministically upon observing a state I will only announce probabilities to take different actions the probability to move up probability to move left and so on and probabilities are continuous numbers and so if you parameterize them you will end up being able to do gradient descent so that's one big trick in this business to turn from discreet to continuous in this case by taking probabilities and so I will now formulate already in words what will be the strategy and then we will derive it mathematically but what really happens is you take these probabilistic actions so you move up you move left it's like a random walk but maybe with biased probabilities so in this location you are more likely to move up than down and so on you go through the game and you see what happens in the end whether your reward was high or not so it's a little bit like the random walk but maybe already a little bit smarter and if the reward turned out to be high for this particular trajectory you will modify the probabilities you will modify them in such a way that all the actions you did actually take in this trajectory become more likely because apparently they were associated with a high reward in the end so why not make them more likely so if you move primarily up in your trajectory and that gave you a high reward because it happened to be that the box is sitting in the upper end of this picture then you will improve the probability to move up that seems like a good strategy now if you think about it a little bit more you will run many many trajectories many many of these biased random walks and some have high rewards some have lower rewards and it will also occasionally occur that you have a high reward trajectory that gave you a high reward in the end but some of the actions you took were actually a little bit stupid you moved away temporarily and only to return later so you made a little loop that's not smart yeah in this prescription you will also reinforce these action probabilities of the bad actions because they occurred together with a high reward but don't worry because statistically speaking the bad actions will occur more likely in low reward trajectories than in high reward trajectories so if you average over everything you're still going in the right direction yeah so in each particular example you may also reward bad actions but overall average over everything it will be good okay so this is in words what will happen but of course now we will derive it mathematically okay so here's again our little picture an agent and the environment but now we already said the trick is to introduce these action probabilities yeah so these are conditional probabilities because they depend on the current state depending on the state I may have a larger probability to move up another state will have a larger probability to move to the left and so on so the way people write this down is in terms of this probability distribution that defines the policy and I cannot help it but the name for these probabilities is pi in this field so pi is not 3.141 but pi is a probability so pi is the conditional probability of taking action a given the observed state s which again could just be your location specified in terms of coordinates or s could be a full picture of your environment can be anything it's conditional probability of taking an action a given the state s I put little indices here because I want to say this is at time t maybe at later time I have different states and therefore different actions okay and in order to be able to learn I want these probabilities to be able to change now I could take each of these probabilities as its own variable that I can change and if there are 100 of them then there are 100 numbers that I can change maybe via gradient descent but once the state space becomes really large like state is an image or the action space becomes very large then I do not want to just write down an incredibly large table of such action probabilities instead I want to parametrize them maybe with the help of a neural network and that's why we put this little subscript theta at the action probabilities so the action probabilities eventually will be parametrized in some way that I can choose that is arbitrary need not even be a neural network I can also just write down some unsets by hand but this is the most general thing we can do and eventually we will do gradient descent with respect to these parameters theta as I'm changing theta my action probabilities change and therefore also the trajectories that I will be generating will change so coming back to the robot which basically was like a random walk where I give a reward if I hit the right spot it's really like I'm learning the right probabilities for a random walk for those random walks that are most likely to generate high rewards possibly in a very high dimensional space okay is this clear that's the central object of policy gradient because it's the policy and it has a parameter so we can do gradients okay it seems like this is accepted and as I just mentioned oftentimes we parametrize the agent in terms of a neural network so the policy is really a neural network mapping observations to actions okay so to make it clear in the case of the robot if the robot currently is at this position it will have a table of four action probabilities the action to go down up left or right and these are probabilities they are normalized and when it then really executes a trajectory it will pick the direction the action according to these probabilities okay and so here comes a very interesting question so we're coming back to where we started so we wanted to optimize the reward the overall reward that we get right and we are now already on track to use a gradient ascent so that seems to be already a good step but here comes the question will we need a model of the environment will you have to be able to simulate the environment's evolution for a given action in order to evaluate the gradient of the reward you want to move into the direction of higher reward you have made sure that you can at least take gradients because you parametrized your action probabilities in this so these are continuous things that's good but will we need to be simulating the environment's evolution and at first it seems like this might be the case right because if I take this action if I move up I get some reward if I move to the left I get another reward it seems like in order to predict what happens if I change my action probability what happens to the average reward it seems in order to be able to predict this I would need to know what's the effect of the action if the action is up then something happens in the environment if the action is going to the left something else happens yeah so it seems like I need a model of the environment and that would be scary because in physics sometimes we do have a model of the environment I may have the Schrödinger equation for a quantum system and I can predict what will happen but if you think of the board game certainly I do not have a model of the environment because the environment includes not only the board and the rules of the game that is deterministic but it also includes the brain of the opponent and I certainly cannot predict what the opponent will do if I take a certain action so so that seems like a still critical point okay well let's go ahead anyway and at first we have to commit to some way of describing the environment and the most general way to describe it which also nicely fits already this action probability business is in a probabilistic way so we say if the environment is in some state s and I decide to take an action a maybe the way it jumps to a different state as prime is not necessarily deterministic maybe it's probabilistic maybe even if I don't apply any action maybe the environment is doing some funny random stuff yeah that could be true so what people do is they just write down at least in principle as a mathematical object this transition function for the environment that says given that I'm currently in a state s and I'm taking the action a or my agent has decided to take the action a where will I end up in in which states as prime can I end up in now again this sounds scary because as I said with a board game of go this would be like I observe the state of the board I decide which action I take and then I have to predict or at least this function would tell me what my opponent does next in order to see what is the state of the board after the opponent takes the move yeah that doesn't seem like something I would know maybe it can be deduced after observing many many many runs of the game and learning what my opponent is doing but it seems difficult nevertheless let's just move ahead and let's keep fingers crossed that somehow in the end this will not be a problem this sometimes happens in mathematics right you are manipulating objects that you already know you will never be able to calculate or have access to but they are still good conceptually okay now we need to talk about trajectories so for me a trajectory is just the sequence of states and sequence of actions so I tell you the initial state then the agent takes an action I end up in another state which maybe follows deterministically or maybe this environment makes a funny random transition okay and so I go through a sequence of states and also I go through a sequence of actions and the entirety of all these states and actions that I define as my trajectory and whenever I play the game fresh I get a different trajectory so I play the game of go once and twice and the third time and every time probably the game evolves differently so every time I get a different trajectory and for abbreviation we call this trajectory tall so now we can think about things like what's the probability of a trajectory of course this trajectory is a completely high-dimensional vector yeah so it sounds even more scary than before but let's just move ahead so what's the probability of getting a certain of observing a certain trajectory tau when I do play this game many many many many times well so let's say we start in a definite initial state or we could also make the initial state probabilistic that doesn't matter so I start there in s zero and then the agent looks at this state as zero and announces certain action probabilities okay now I take one of these actions according to these probabilities and based on the current state and the current action I know in principle that my environment will make a stochastic transition to a new state so this is in this capital P transition probability that we just discussed now I'm in this new state the agent looks at the new state again decides on an action and the thing continues so step by step by step I have these probabilistic transitions half of the probabilistic transitions are because the agent decides on an action and half of the probabilistic transitions are because based on that action in the current state the environment also decides where it wants to go next and so the probability of going through a particular sequence well that can be just written as a product of probabilities because these are all conditional yeah so I start in a certain state based on that state the agent decides on an action based on that action and the state the environment decides on the next state based on that the agent decides and so on so these are really all conditional probabilities and to get the full probability I just have to multiply them all and so that's what I wrote down here so the product is really a product over all the time steps in the trajectory and I'm taking in this product always a pair of two probabilities the action probability and the environment transition probability so you have this long product string of course it's not anything we can hope to easily evaluate and so on or average over at least but still what we can certainly do is we can sample from it and that will become important yeah because if I start in a state and my agent then announces the action probabilities I can sample from the action probabilities and commit to a certain action and then the environment will do whatever it has to do so it samples from this capital P transition probability and I don't even need to know that the environment will do that for me and then it's again the turn of the agent the agent has to decide on the next next action and so on so I can certainly sample one trajectory and then when I run the game fresh I sample the next trajectory so sampling is okay okay so now I have to introduce something else we talked about the reward and now we want to be a little bit more precise sometimes I will only announce a reward at the very end for example plus one if I win the game minus one if I lose the game and zero otherwise sometimes I can also already at intermediate stages give rewards so maybe I don't know my robot arm is able to move but if it moves too fast it should be punished because it's dangerous for the mechanics and so on so maybe there are intermediate rewards also at intermediate times not only at the very final time when I reach some end state yeah and so we call these rewards whether they are intermediate rewards or given at the very end we call them little r t in this notation for each time step there could be a reward and the thing that I want to optimize is of course the sum of all these rewards so that may consist of this reward at the very end but also rewards that punish I don't know doing too many controlled not gates in a quantum computer because these gates are very faulty or something like this so we call this sum over all the rewards the return that's the language of the field capital R is the return is just the sum of all the rewards sometimes people call this also the cumulative reward function okay so this is the thing we want to optimize and for each trajectory it can be easily calculated that's that's designed by me the user I look at the end state and say oh that's pretty close to what I wanted so you get a high reward so that's not difficult to evaluate for a single trajectory okay so now what do we want to optimize well I gave you the expression for the return the cumulative reward but that's for a single trajectory of course I'm happy if in a single trajectory I come out good but then the next time I run the game maybe I come out bad I want a strategy that on average gives me a high return an average sum of rewards on average over all the trajectories that I'm running yeah because otherwise I'm just lucky okay so let's write down the average return the average return is really just averaged over all these trajectories that we discussed before and where in principle we know the probability for each trajectory okay so this is done here so the average return I will sometimes write it with an over bar or sometimes expectation value of R that's just as for any expectation value the sum of our possibilities the probability of that possibility times the return assigned to that and the possibilities in question here these are the trajectories yeah so I'm summing over all trajectories multiplying the probability of a trajectory with a return for this trajectory and so I get the average return yeah I hope that's clear so in principle this is the object we want to optimize and as you can see from this formula since the probability of a trajectory depends on this parameter theta which which with which we parametrized our action probabilities this whole average will depend on theta and so it makes sense to do for example gradient descent or gradient ascent rather because I want to optimize I want to increase the return now of course no one can actually evaluate precisely the sum because remember the trajectory is really the string of states and actions and even if I only have two states and only two actions if the trajectory length grows then I will have two to the number of time steps or four to the number of time steps different possible trajectories so no one can evaluate this still we can write it down and so what we want to do is to try to optimize this return by doing gradient ascent in these parameters theta so formally all we want to do is just like for the cost function minimization or we want to return function maximization so we want to move along the gradient of this and we want to calculate the gradient for that purpose so any questions so far so we have set up everything and now it's mathematics and hoping for some good luck that in the end we do not need an explicit model of the environment exactly yeah so the theta will be the parameters of the agent in the more advanced cases it will be indeed a neural network in simpler cases where the state space and the action space is small you could even write down all the probabilities in a little table for each state and each action what's the probability and then these entries of the table itself they are like numbers continuous numbers and you could take them if you wanted to as parameters so you directly optimize them and so now before we go to the solution here's a little exercise again so the first question is easy can we sample the so let's even forget about where this came from in reinforcement learning we just say we have a probability distribution over certain events j this probability distribution p depends on some parameter theta so i can change it and i want to average some observable capital r so each event j has a associated capital r i want to calculate the average of r so the direct formula is written in the first line of course sum over p times r and the question is can we sample this via Monte Carlo and the immediate answer is yes if i can sample from this distribution p of j i'm just throwing dies somehow according to these probabilities okay um then each time i note down what is the current value of r and i do this a hundred times and i simply take the empirical average over these 100 times and i will get an approximation to this sum yeah so that's what we could call Monte Carlo so i just sample from this probability distribution i throw my dies many times and and average over the over the results but then there's the really interesting question now i am interested in the gradient of this function with respect to theta the gradient of this expectation value that i wrote up there with respect to theta so the derivative with respect to theta the question is can we sample this via Monte Carlo and again i want you to discuss with your neighbor please and um think very hard about it first you should discover oh there's a problem uh and then maybe you can even come up with a solution so please discuss this again for five minutes this is the main trick of the whole of the whole strategy okay maybe still one minute and then i take suggestions or insights okay good so um is there any insight as to why this is a little bit more tricky so what's the problem can anyone volunteer what's the problem is there a problem yeah okay so um let's assume that let's assume that this is not difficult and i do have an exact expression so what i really want is this um why is this still hard why is this still not obvious immediately how to sample with Monte Carlo exactly so i want what i want what i need always when i want to do Monte Carlo is something of this sort where i have something times the probability ah okay so your trick would have been try to take this derivative and try to rewrite it in terms of a probability distribution yeah unfortunately this is sometimes positive and negative so there's already a problem um yeah any any suggestions how to come from this kind of expression to that kind of expression yeah do something by parts yeah that that's a very good idea please multiply and divide yes that's that's basically the same idea i guess but yes multiply and divide so what i want to do is i really want to force having a probability here so what i will do is simply take the rest of this divide by the probability multiply by r because that was there anyway and now i can interpret this thing as my question mark so to speak yeah and this thing i can sample because it's in the usual style of an expectation value yeah so suddenly i i don't really evaluate directly this but i try to sample the expression here and now this actually is a logarithmic derivative so you if you like to you can write it in this form it's not different of course but it's the same okay so this is one crucial step here and so we will make use of this so let's come back to our gradient of the average return so what is shown here is well i take the gradient i pull it in and then i do exactly the trick that i just showed on the blackboard and that you discovered and now i still have to evaluate this logarithmic derivative of the probability to have a certain trajectory now the trajectory probability itself was this product of many conditional probabilities now very fortunately the logarithm of a product is the sum of individual terms so that makes things already much easier and now if i take the gradient of that sum only those terms will contribute that do depend on theta and now you look at it and you see oh great these environment transition probabilities they do not depend on theta so they completely fall out and the only thing that remains are these action probabilities that define the policy yeah so what you suddenly see is that the logarithmic derivative that we do need is the sum over all time steps that simply is generated from this product over time steps now i have a sum over time steps because of the logarithm of these logarithmic probably logarithmic derivatives of these action probabilities yeah now these action probabilities in principle at least i know them yeah maybe i have parametrized them myself or maybe i've written down a neural network and the usual neural network packages can help me to get this derivative of these action probabilities so that's not so bad but what's really important is the transition probabilities of the environment have completely disappeared from everywhere except of course this initial probability but that is not so bad because remember i'm doing a Monte Carlo sampling and this will happen automatically so i'm just going through all my trajectories the environment does its things drastically it samples for me i don't need to know even these probabilities the environment samples from its own distribution and so this is all contained in the Monte Carlo sampling so this is the important part in in in whatever terms i need to evaluate the environment does not appear anymore and in the only term that it does appear this is replaced by Monte Carlo sampling so this is the magic so so we have seen the three magic steps of policy gradient first the idea even to turn a discrete problem into a continuous problem by going from discrete actions to the probabilities of these actions and um and then this idea of how to Monte Carlo sample from such a gradient by by doing this multiply and divide trick and then finally the observation that in the thing that i'm sampling over the thing that i'm averaging over this thing does not depend on the environment transition probabilities and the only place they do enter is where i use Monte Carlo sampling anyway so is this clear yeah so this is basically already everything about policy gradient uh is this clear is there an open question here okay it seems clear so far but you can always ask later so this is the main formula of the policy gradient method i can now say i know what is the derivative of my average return with respect to the parameters it's just given by this formula where i sum over all times and i take the expectation value of the product of the return for the particular trajectory multiplied by this logarithmic derivative for this particular trajectory so i'm emphasizing the particular trajectory because these at and st in here these are the actions and states for a particular trajectory that i got when i ran my game 500 times one of these trajectories at this point in time has the action at and the state st and i'm evaluating the gradient exactly at this spot and then i can do so chastik gradient well i should have written ascent so because i'm wanting to go up the reward uh so i just as you learned it for neural networks i just have a learning rate eta and i walk up the gradient and i'm improving my average return all the time and of course as usual uh this expectation value should have been taken over all trajectories i cannot do that but i'm Monte Carlo sampling it so maybe i'm rolling out 20 trajectories and i'm taking the empirical average over these 20 trajectories that's already good enough and i'm taking one step of my uh gradient ascent okay and so now this is just a formula but we can try to interpret it again so what really is written down here is the following um i have run my trajectory i've noted down all the states and the actions and by moving theta in the direction of this logarithmic gradient i'm actually increasing the probabilities of taking these particular actions that were in my particular trajectory because if i do if i move theta in the direction of the gradient of the probability or the logarithmic gradient doesn't actually matter so monotonous uh i will increase these action probabilities so it's exactly as i said in the beginning if you have a high return trajectory where this capital r is large and positive so to speak uh then you will move in the direction of increasing all the action probabilities that of actions that you actually took now um you may wonder a little bit uh so what happens if if r is always positive don't we always increase the action probabilities but the point is simply these probabilities are normalized yeah so if i'm increasing one and not increasing another so much the other will automatically get suppressed so so it all works out in the right way and it will be the case that those action probabilities will win that are associated statistically with the highest return they will simply blow up the fastest so to speak okay so that's uh policy gradient um now here's a little side remark um actually you can simplify it a little bit and sometimes this helps so instead of multiplying with the total cumulative reward the total return you can also just take the return as counted from the given time step so i i've written down the formula here so capital r t would just be summing over all immediate rewards but only counting from t and not the earlier ones yeah capital r itself included everything our t only includes those starting from t and so why does that still work and actually give the same thing well if you think about it um the rewards at earlier times they are not influenced at all by the action i'm taking at time t so to decide how to move the action probability for the action at time t i do not need to know what were the rewards earlier because even if i take a completely different action next time uh at time t in the next trajectory at time t i will still i would have gotten the same rewards uh at earlier times yeah so the action at time t will not modify rewards at earlier times as a kind of causality so why does this slightly modified expression help me sometimes it certainly doesn't make a difference if the reward is only at the very end of my game you know because then capital r t is the same as capital r and everything is the same anyway uh but if i do indeed have this uh sequence of little rewards accumulated over time then uh keeping only the returns from the given point in time to the end of time will result in a quantity that has less fluctuations and that's often helpful yeah the expectation value is the same in any case i'm going in the right direction but it's good to have a smaller variance of this because i'm estimating it remember from a finite number of trajectories so it can be helpful and then one final remark in this direction there's also a thing that people call discounting so uh what you can do is you say well let me say i'm most interested in optimizing the immediate reward at this time step and i'm already interested a little bit less in what happens the time step afterwards and the time step after that one so what people do is they rewrite this uh future return capital r in the right hand expression where they multiply with a small factor to the power of the time that has elapsed so uh at t prime equals t it's uh gamma to the zero so that's one but in the next time step it's already a factor of gamma the next time step is gamma squared gamma to the third and if gamma is a number less than one it will suppress the influence uh of uh rewards that happen more or less uh far in the future now um if you think about it if gamma is really small then you're only trying to optimize the immediate reward here and nothing else yeah so you're very short-sighted in your actions so instead of trying to win the game you just want to get the next cookie so to speak and so this is officially known as a greedy strategy and of course we know that greedy strategies are bad because you may be too greedy and get a little bit reward now but don't get a lot of reward later on yeah so it's clear that the correct solution would be the non-greedy one where gamma equals one where i don't discount but sometimes learning is a bit more stable and you have less fluctuations of the of the gradient signal if you do have a discounting and so what people do is some something intermediate yes and maybe they can even change the discounting factor over time okay well any any questions about any of this oh yeah so we have in principle at least on paper we have solved all our problems that we identified in the beginning well we still have to see how good it works but we turn from discrete actions to these continuous probabilities and so on and so on and we don't actually need a model of the environment okay so now we can let our little robot run around initialize with some action probabilities that do depend on this current location and tell me whether to move up or left or right i can initialize them randomly i run one trajectory i note down the reward i run the next trajectory i run 20 trajectories i update my probabilities according to the formula we just saw reinforcing those actions that were statistically correlated with high reward trajectories and then uh with the new probabilities i run again trajectories and so on and so on okay so now we want to see this in practice and instead of going to a complicated example the good thing if you enter a new topic and you want to learn about it is to find the simplest possible toy models that you can possibly think of yeah that are better than nothing so to speak uh like the harmonic oscillator or the two-level system or the particle in a box if we think of quantum mechanics and so the challenge now for you again for the next five minutes would be can you think up of a super simple toy example for rl maybe formulated in the way we described it with states and actions and so on so what would be your choice and there is no one single correct answer i will present one what i call a very simple toy model but uh i'm really interested in what you can come up with so next five minutes invent your simplest possible rl toy example okay maybe still one minute and then i will call for suggestions okay so are there any suggestions a really simple rl example maybe not so simple but what did you come up with okay yeah finding the ground state of some quantum model but uh classical okay good some whatever h of s equals blah blah blah some s i s j j i j and so on and how would you do that how the energy changes yeah so the rewarded time t is probably somehow the energy change yeah yeah okay good yes okay yeah that that could be something it's an optimization task it split up into different little actions and i'm probably observing maybe the whole thing uh the whole spin configuration okay i don't consider that the simplest possible model but yes i like it actually yes big tacto so yeah a game yes okay good yes so that's uh interesting so that's already a little board game so to speak yes with reward i guess depending on how you how you do in the end whether you win yeah so that's actually the other state space is small enough that probably i can store things even still in a table yes any other suggestions i mean there are no right or wrong answers but yeah okay so that's the robot in the box and you get a reward when you get the box but uh what would be the observation in your in your case is the box placed randomly for example that's an interesting one of the interesting questions okay so the robot is placed randomly what does the robot know does it know its coordinates so to speak or not yes it knows it's coordinate and it has to learn okay if i'm so to speak at these coordinates left of the box then i should move to the right i guess yes okay yeah so um i will give you an even simpler example so i claim is the simplest reinforcement learning example ever um so i do have a random walk and the only action that i care about is moving up or down so the only probability that i can change is really the probability of moving up because the other one is one minus that and my return is simply where i end up in the end so how large is the coordinate that i end up in the end and you tell me immediately what's the optimal strategy always go up yeah so the probability of going up should be one so we already know the solution but we want to see how policy gradient arrives at this solution yeah now i should say this is really simple because you realize that the state space the the observation doesn't even enter the picture so the agent is not even we don't need to tell the agent where it is there is only the action probability so it's super simple and we come to an interesting example later but it's already interesting enough so the first question then is how would you parametrize the policy so the policy here is just moving up or moving down and since one probability is one minus the other it's really just about parametrizing a single probability so how would you do that how would you parametrize that probability well one choice is just to say this probability itself is my theta you could always do that the problem with that is if this is theta and ranges between zero and one and directly maps on the probability then it could easily happen that in your gradient ascent you step out of the allowed range of values for theta and then you have to cut it and then it's shaky numerically so people don't like that so it would be better to have a parametrization of my probability that automatically stays between zero and one and there are many choices if you have any suggestion yeah sigmoid exactly because that's much nicer so I have a theta and I want something like this it's between zero and one so to speak and so that's the probability as a function of theta so the probability I'm talking about pi theta of the action going up would just be one divided by one plus e to the minus theta so the usual sigmoid function so if theta is large this becomes zero I just get one moving up if theta is very negative I get zero here so I'm moving down yeah so this is what we can do so here's the policy probability of moving up is the sigmoid the return as I said is just how far you got at the end of the trajectory let's say we have capital T time steps so that's clear and then the reinforcement learning update I've just written down here so it's again this logarithmic derivative of the action probabilities for this particular trajectory multiplied by the return for this particular trajectory and then summed over all times and averaged over all trajectories so this is the this is the update and so we can actually calculate things so that's the beauty of this example everything can be done analytically so that's why I introduce it so for example this logarithmic derivative it's a very short calculation you can express the end result again in terms of this probability so the logarithmic derivative is either one minus the probability if the action is up or minus the probability of up if the action was down so okay and then you can actually carry out the sum over all times of these logarithmic derivatives of the action probabilities for a particular trajectory so this trajectory remember has action plus one plus one minus one plus one and so on so it has a particular sequence of plus ones and minus ones these are the steps that you take but if you work it all out and just use the formulas above then you realize the you have a definite expression for that namely the sum of all times of the logarithmic derivatives is just the number of steps that you took up in this particular trajectory minus n that's the total number of time steps times the probability to go up and so n times the probability that's of course the average number of steps I'm going up and the first part is the actual number of up steps in this particular trajectory because that fluctuates from trajectory to trajectory yeah so in a sense this thing that is written down here teaches me how more I'm going up in this particular trajectory than on average and remember that's the thing that multiplies the return so that's now a very interesting observation so if we write down the actual update of the theta parameter it's I've written it down here for you it's the expectation value or the yeah expectation value of a batch of trajectories for example are that's the return times this difference that we just discussed between the actual number of up steps minus the average number of up steps and so my question for you is what happens to the up probability if I'm looking at a trajectory with more up steps or say differently so let's say I have many different trajectories and I observe that the trajectories that on average have more sorry the trajectories that have more up steps than average trajectories they get a higher return which will actually be the case yeah we know this already because our return is so constructed that we do prefer trajectories that primarily go up so what happens to theta so is this then positive or negative so it's pretty easy to decide it's positive it's positive absolutely because this n up minus average is positive by definition and so it goes along with the high rewards so this delta theta even when averaged over many trajectories will be positive and remember here going up in theta means increasing the probability for it to move up and that's exactly what we want right so what will happen is we roll out many trajectories we observe that those trajectories that just by a random fluctuations have a little bit more up steps they get rewarded a little bit more and because of that the probability to go up increases and so yeah this is just repeated here and so then we can for example calculate what's really the average the average the average rl update so if I really average over millions of trajectories let's say and we can write down an explicit analytical expression for this update for this averaged update and what we will find is that for example if I specialize to the case where the action probabilities are still 50 50 then I can evaluate this explicitly and I'll find that yes of course I'm moving up just as we just discussed I'm moving I'm increasing my probabilities here's the general calculation again this would be a little bit homework exercise but what we find is that the average step in theta in my gradient ascent will always be positive I will always increase my action probability to go up but it does depend on what's the current action probability so I'm plotting here the outcome of this calculation so the average update plotted as a function of the probability to go up and you see I get the biggest changes when I'm still on the 50 50 side and I get much smaller steps in theta much smaller updates when I'm already close to the good solution which is up probability is one but I also get very small changes if I'm close to the bad solution unfortunately and the reason for this is also quite simple to understand if I only show you trajectories that are going up well then you don't need to move anyway but also if I only show you trajectories that are going down you never once encounter a trajectory that teaches me what would have happened if I had gun up yeah you only see the bad stuff and so you cannot even compare the things and it's best if you're in the middle of it if you have 50 50 because then you have a large variety of different trajectories that you're looking at and you can very definitely say oh yeah going up that really helps so that's that's nice of this example that you can interpret all these things and then you can really run it yeah then you can really run it so this calculation what we did is what's the average update to theta what's the average update to the probability but here I've really run Monte Carlo simulation so I've run through trajectories I've updated my theta according to the reinforcement learning policy gradient update rule and then I've plotted how the probability of going up changes as a function of the training time as a function of the number of trajectories that I went through and I always started with 50 50 so it's a completely unbiased random walk but then you'll see that it moves up as it should and it eventually ends up at the right fixed point that is always move up but what you can also see from this example is the fluctuations in learning can be pretty large at least for the parameters that I chose here so it even happens in this green trajectory that I already learned almost the correct strategy and then whoops there is some relatively catastrophic collapse back to smaller probabilities and then I slowly go up again so we can have strong fluctuations that's of course an extreme example but it's it's teaching us something okay just a quick remark I want to finish up this example for today and then the rest we continue tomorrow what you can also do in this beautiful analytical example you can even calculate not what's the average update step but what's the variance of the upstate update step and how does that depend on the number of time steps in your trajectory and so on and what you will find is that the the average update here is plotted in blue it scales like linear in the number of time steps that's just the way it comes out and the variance is much bigger as this green curve it's like the three half power of the number of times so it's pretty bad the fluctuations are really big but we already saw this directly in the numerics now of course you can always suppress these fluctuations by running more trajectories by averaging over a batch of trajectories you run 100 trajectories and automatically if you average the gradient you will do better and you will do better by a factor of one over square root of the number of trajectories in a batch so that's not very surprising but I also want to mention here already that there are tricks that you can play and that's a very interesting trick that I'm now going to discuss in the last three minutes or so so what you can do is you can say the following this return is a little bit funny right I want to maximize the return so if you just shift the return always by the same constant it shouldn't matter right so if if one person gives me a return that ranges between five and seven and the other person subtracts three from each of these numbers still the maximum will be the same and probably my policy gradient should somehow evolve in the same way yeah so I can shift around my return and the first thing to realize is I can do that without any penalty the average update will be exactly the same so this is what I tried to write down here I subtract from my return some constant b so b is now really a constant doesn't depend on the trajectory r does depend on the trajectory b does not it's just a constant and the claim is that this is the same whatever I choose for b if b is a constant and the proof for that is basically if b is a constant I'm just taking this expectation value of the logarithmic gradient and I've written it down here again but then you realize oh if I really sum over all the trajectories and take this p times logarithmic gradient I can basically undo this transformation that we did here and I will realize that this is the gradient of the normalization sum but the normalization sum is always one so the gradient is zero yeah so that's a little mathematical detail but it teaches me yes indeed as I expected if I take a constant and subtracted from my return nothing changes but that's not quite true what does not change is the expectation value of this what can change is the variance and so that's one of those tricks where you keep the expectation value the same so on average you move in the same direction which is good otherwise you would be doing something wrong but you are reducing the variance if you pick the right b you can actually reduce the variance and you can also find the optimum b you can look at these formulas and say what's the variance of this whole thing and then let me find the minimum of this with respect to b and then you get an expression for what should be your optimal choice of b I will not present it to you it if you are really interested in learning more about reinforcement learning that would be a little homework look at these expressions take the variance see how b should look like there should be expressions where there's a logarithmic derivative squared should appear and some just as a little hint I will not show it here because the while the concept is one that is very important we will come back to it tomorrow of subtracting such a constant the the particular solution here is not the one that is actually used yeah okay and so that really can help you so this optimal baseline when I do the subtraction trick suppresses the variance down to the red curve which is not only smaller here in this picture but it also scales differently with a number of time steps so that really helps and so yes please yeah then then really nothing happens because that just multiplies everything and then I will just maybe take smaller steps in theta but then it's like changing my learning rate so it wouldn't help me so much okay now I have to think I still wanted to give people a homework and so I should still explain so yeah okay um so homework um here's the second simplest RL example ever now we want to have a situation where there is an actual state an actual observation coming in so the probabilities do depend on something and I will define for you the little problem again I have a random walker maybe in this case it can just move up or stay on the same side so plus one or zero that's the actions but now there's a particular target site that is randomly chosen in each run of the game and it will be rewarded according to the number of time steps that it remains on this target site you can already guess what's the best strategy the best strategy is to move very quickly to the target site and then stay there in order for it to have a chance of doing this it should be told whether it is on the target site yeah so the observed state should be say zero or one depending on whether it is on the target site or not and so also the action probabilities moving or not moving do depend on the state so we have two different actions and two different states that it can observe and the homework I suggest is that you try out to implement the reinforcement learning update that we discussed for this kind of example first you have to parametrize your actions maybe in a similar way as here only there will be more of them because they do depend on the state and then you run these trajectories with a random number generator of course drawing probabilities according to these parametrized probabilities and doing the update according to the RL rule that you learned and that's a little python program or so that you can program and you can team up I hope so yeah but there's still a question so but the agent needs to know what is the target site because otherwise it will never have a chance to do the right thing otherwise it will always have to move around randomly or maybe deterministically but since the target site will change randomly from run to run of the game the agent needs to have some indication whether it is on the target site okay if the target site were fixed from game to game then it could work but only if you if you give another kind of state to the to this walker for example the if it starts out from the same position always then it should be told for example the number of time steps that have been elapsed because internally it would have to count okay one two three now I should stop so whatever game you define you need to think a little bit about what is the minimum observation that this poor agent needs to actually solve the task I see okay that's a that's a elaborate thing so um no but that that is much more that is much more complicated so basically if someone only teaches you afterwards okay you did plus five or so then in order to know what you should do next time you should have a memory of the trajectory you actually did so then you can start to guess oh yeah I got five time steps on the correct target let me look at my trajectory oh there was this time period where I remained stationary for five times so I think you need to sit down and think about it but I think it's a much more complicated version of the game okay so I hope otherwise that's clear uh it's a doable homework but it's really fun team up with someone try to program this in python and then we meet again tomorrow if there are a few if there's one or two urgent questions we might still accommodate that otherwise would suggest then to meet tomorrow morning again I think it's again your turn in the morning tomorrow and yeah have a nice evening and fun programming your walker