 which we will discuss in a moment, but let me just first start with a recap of what we did yesterday. I introduced you one of the main techniques for reinforcement learning, which is a set of techniques in computer science and machine learning for learning strategies. And that main technique which we discussed yesterday, today we will get to learn a few others, is called policy gradient. And as the name implies, you're looking at a policy that is a strategy, and you're encoding it in such a way that you can actually take gradients through the policy. So the policy is really a mapping from the observed state of the environment of the world around you to an action. But it's formulated in terms of a conditional probability. So the probability of taking a certain action given a state. And then if you parametrize this set of probabilities, you automatically get something through which you can take gradients. So that is the setup of policy gradient. And more generally the setup of reinforcement learning is having this agent interact with an environment. And it's the policy of the agent that you want to improve over time. Okay, so now let me move forward to where we left off yesterday. So this is the second simplest reinforcement learning example that I introduced to you just before we finished. Remember the first example was simply a random walker where the reward was based on how far it gets. And then of course the solution is that it should have a probability of one of moving in the correct direction. That was still a little bit insufficient to cover the whole domain of reinforcement learning. Because in that case the action that the random walker took was not even dependent on its position. It just had to decide whether to move up or down. Here we have something more interesting I introduced to you. So here we have again a random walker. In this case it has a probability of either moving up or staying put. And in this case the random walker will be asked to stay as long as possible on a certain target site. So in order to be able to do that it somehow has to have information about the position of the target site. Otherwise if it's completely blind it will never be able to solve the task. Unless you were to put the target site always at the same spot then it could count the number of steps. But let's say the target site is chosen randomly in each run of the game. So it needs some kind of observation. You could think of many possible kinds of observations. But the observation you might choose is that it has a local sensor that is able to detect whether it is currently on the target site or not. So the state space, the observation space is just zero or one depending on whether it is on the target site or not. And the action space is also two dimensional whether to move or to stay. So here for the first time we do have a policy that depends on the state. And then you can apply the standard policy gradient reinforcement learning approach that we learned about with a reward that is given by the amount of time steps that it did stay on the target site. So who tried to solve it? Okay, so there was at least more than zero. People who tried to solve it. If you didn't, feel free to try it over the weekend next week during the workshop because it's really nice to see in one simple case where you don't even need a neural network and don't need anything fancy that this actually works and how it works. But I will show you some of the numerical results. So what you see here is the progress at different stages of the training. So we always are looking at kind of space time diagrams and space runs on the vertical axis, time runs on the horizontal axis. And in each of these diagrams I should say, I have tested an agent that was trained for a certain amount of time. To the left it had not been trained at all. I tested it by running several trajectories. Remember the trajectories are probabilistic. And I also made sure that for these test cases I always put the target site at the same location so that it becomes visually a little bit easier to identify what's going on. But of course during training and also during real testing the target site will be chosen randomly otherwise it would be too simple. So what you can see in the left most pictures basically the status near the very beginning of training what we have is simply a random walk with drift. So it's going upward sometimes faster, sometimes lower but doesn't know really about a target site. Here the location of the target site is actually indicated by a dashed line. So what happens after a little bit of training is that sometimes it already ledges onto this target site. So you see here clearly there is a tendency for it to stay there for some time and then maybe to move on. And as training progresses further here in the plots to the right it really knows exactly what to do to move as fast as possible initially and then once it hits the target site it wants to stay. So apparently yes it has learned how to do the right thing and the only signal that it used was this reward signal that after the end of each trajectory told it a number that is say plus five if it was staying for five units on the target site. But of course this is only a number. It does some funny random stuff. It gets announced this number and it does some random funny stuff again and it gets announced another number. So imagine you would be doing this as a human. You would also need quite a many attempts to actually figure out what's going on because no one is teaching you oh you got this reward of plus five because in this time interval you stayed on the correct target site and you realized that this information is missing. It's only a number this reward. So you have to try many, many times and in this case you're using policy gradient that is in all the cases where you get a relatively high reward. You say okay obviously I did something right. So let me reinforce increase the probabilities of those actions that I took in this trajectory which may have been whenever I did get a signal of zero from my sensor which means not on the target site I had a relatively high probability of moving up so let me reinforce this and whenever I got a signal one which of course we know means you are on the target site I happened to have a large probability of staying and so I should really reinforce this. Okay so this is how it goes. You can also plot the evolution of the policy visualized in terms of these probabilities and remember state space is two dimensional for each state I also have two actions but since the probabilities for the two different actions are normalized I only need to plot one so in total I need to plot one probabilities for each state so on the vertical I put the probability for remaining at rest zero given that my sensor gives me signal one which means I'm on the target site we know this should be large ideally and on the horizontal axis I'm plotting the probability of moving given that my sensor tells me zero so I'm not on the target site again we know this should be large and so wherever you start you can start at random places initially you will by this policy gradient procedure move in a kind of random walk in parameter space because all of this is stochastic I'm running a batch of trajectories getting random rewards so nothing is deterministic but there is an unmistakable drift and so eventually all these drifting training trajectories are converging to the right spot which is the fixed point so I'm staying put when I'm on the target site and I'm moving otherwise so you see this actually works any questions about this example and if you implemented a play around with say the batch size of the batch size of trajectories that you're training on would be smaller or the learning rate larger than these training trajectories in parameter space which look much more noisy but also it could potentially be faster so you can play around okay so yeah that's a general question that we can now ask how should we parameterize the action probabilities I gave you the hint yesterday that just proceed in this particular case just like the ansatz with the sigmoid that we introduced already for the simple walker only now you have of course two different states so you will have two different parameters theta so then you go but in general what would you do and so in general your state space may be much larger and also your action space can be quite significantly larger and then you can use all the techniques that you learned in the beginning of this week namely you can use a neural network where the input would be the state and the output at the different output neurons would be the different probabilities for the different actions so I'm I have drawn in drawn it again for this very simple example where the input was just are we on the target zero or one so it would be a single neuron but again this input now could be an image maybe the image from the video camera of your robot so then it would have much more neurons then you have a few hidden layers here only one and then you have the output which is to stay or to move and the values of these neurons would give you the policies so the probability of an action given a state so it's as simple as that and in this case the thetas are not parameters that you write into your ansatz but these are simply the weights and biases of the neural network and now since the output represents the probabilities of different actions they have to normalize sum up to one they have to be normalized and so you already learned I think from Ilishka how you would do that you would use a softmax activation for the very last layer which is exactly of this type that it makes sure that you can interpret the last activations as probabilities because they will be between zero and one and they will be normalized okay so that's a powerful technique so suddenly you start parametrizing in this very arbitrary way and if the input were an image you could use things like convolutional layers obviously so let me summarize then how does this policy gradient work if we want to run it including the neural network and so on so first let's talk about obtaining one trajectory what are the steps well as you go through the trajectory you will execute an action that you've selected previously and you will record the new state and so this is typically in a computer program done by by calling a function that represents the environment so you tell this function please I want to do this action so I want to move up or left for example and then this function internally will have the logic of the game embedded and it will return to you the new state which might be the observed image you have or only the zero or one in our case so there will always be this environment and sometimes it's you that programs the environment then you can do tricks like vectorizing it for example if it's a physics simulation so to speed it up but sometimes this is also a black box I mean it could be a black box because it's another computer program to which you have no access except on this interface like initially they were trying out these things on video games so maybe you don't have the code directly available off the video game or maybe this is literally the controller of a actual robot moving around in the real world okay so you now got your new state you will now feed this new observed state into your novel network to get the action probabilities for the next action and then you will sample from these action probabilities so you will pick the different actions depending on this probability distribution so as to obtain the action for the next step and then you go on and you go on and you go on until you reach the end of the trajectory and that is either defined by saying oh we have a limited amount of time or maybe we reach some goal up to a certain precision so that depends on you how to define it okay and then at the end of the trajectory you will actually get a reward so now we have done one trajectory in reality it will maybe be even a batch of trajectory depending on how you implement it you will obtain the overall sum of rewards that is the return for each of these trajectories and now you can apply the reinforcement learning policy gradient technique so you will look at the actions that have been taken and also the corresponding states and all of these trajectories and you will enhance the probabilities by this grad of log of the probability formula with a special emphasis on the high return trajectories because it was the return times the grad of the log of the probability in which you are moving your theta parameters okay and so then that is the whole pipeline but how would you actually implement this and here's a little trick remember so we said we have the neural network into which you feed the state and get the action probabilities using softmax for example and now you want to implement this reinforcement learning policy gradient update the R, grad, log, P but you want ideally to phrase it in terms of things that you already know for neural network supervised learning and so what you can use is a trick namely to use this categorical cross entropy which you did I think come across right so if you want to categorize images for example you say the neural network will also output you things that you can interpret as probabilities the probability that this image is a cat or this image is a dog and then if you want to compare against the correct solution you will use a cost function that looks like this, this categorical cross entropy which is always there to compare some desired probability distribution against an actual probability distribution that your neural network produces for you and so we will do the same here except for each time step in each trajectory as the desired probability distribution in terms of this categorical cross entropy we will use a probability distribution that is very simple which is only has a non-zero entry for the action that was actually taken in this particular trajectory because we want to reinforce that particular action we want to reinforce it particularly if it had a high reward and we can even do this by well by cheating a little bit so we could adapt the learning rate or something but we can actually cheat so we can just put for the desired distribution at the action location that was taken at this time step in this trajectory we can put capital R the total return for this trajectory this is not normalized but somehow all the machine learning frameworks don't check this and they don't worry about this so P of A will be R for the action that was taken zero for all the other actions and then you have log of the neural network expression for A given S so S is the state that was present at this point in time and again A is the action that was taken and if you do it like this and implement this cost function and then use your usual neural network routines to minimize this cost function you will get the right result so the neural network optimizer will take the gradient with respect to theta of this expression and if you work it out and think about it this is exactly the policy gradient reinforcement learning update and you can use the batch averages and everything that you have and depending on how you implement it you can use it in a very smart way so you can say I take all the time steps of all the trajectories that I'm currently considering so these can be many I put them in one giant batch and in this batch there will always be pairs of states and actions and the state is so to speak the input and the action or rather this P of A probability distribution that has an R in the right spot will be the desired output of the neural network and then it becomes like a supervised learning problem a large set of states and a large set of desired probability distributions constructed in this way and you take this gradient of the cost function so that works I put it as an example to make it really clear here I don't know you are all using different machine learning frameworks for a long time we had been using Keras with TensorFlow and but it works the same in all the frameworks so what you would have is a large array which is the inputs the states so this would have a size N times the state size so state sizes well depending on how large is your state so 784 is a if it's a 28 by 28 pixel image for example but N would be really the number of state action combinations that you consider so could be the number of trajectories in the batch multiplied by the number of time steps for each of these trajectories so you have this gigantic array for all the inputs representing all the time steps and then you also have this big array of desired outputs which again is an array of size N that now becomes the batch size times the number of actions because that's the size of the output for each neural network and so if you then say train on this batch with the categorical cross entropy that you specified earlier you get exactly the right thing okay so are there questions about these technicalities but they are very convenient and you see how it all can be drawn back to supervised learning apparently no questions yeah so if you now were to do the target walker example you cannot only solve it in the way that we describe but you can also use a neural network and if you use a neural network maybe then you can also change the state space if you like in order to explore a little bit more okay and so let me wrap up this policy gradient setting by talking a little bit about AlphaGo so as I said in the very beginning in the introduction Go is considered a very complex board game simply because of the sheer number of possibilities to play and so it was unclear how to really solve it competitively with a computer so what they did and what DeepMind did in this AlphaGo paper in this very first paper on the subject matter than there came others was the following so they started actually with supervised learning they looked at a big database of games played by expert players and they tried to so to speak set up a neural network that would be able to mimic these expert players so looking at a given state the board image of the board it was known which move the expert player would play for this particular state because it's a state in one of those games in the database and so what they did is apply exactly what I just explained for reinforcement learning they now applied it in the supervised learning fashion so they had a neural network that already was set up to give you a policy so to give a probability of a certain action that is a move given the observed state but instead of using reinforcement learning on these many randomly sample trajectories they would just use it to increase the likelihood that the neural network would propose the action that the expert player had made in this particular observed state which you can do in this way and this again it looks like reinforcement learning or you could say this is just the categorical cost entropy cost function just to mimic expert players so this was the first step later they got completely rid of this step actually and I will comment on that in a moment but at the first day they started like this and then here I just want to highlight that yes they were using the policy gradient reinforcement learning so this is in the paper this is the first time that I also read about this formula back then so in this case they call the reward Z and this is just the reward for winning or losing a game so plus one or minus one in the end and here you see that you try to change in this case theta is called rho so sorry for the change in notation you try to increase the log likelihood of the probability for taking actions weighted according to the final reward so if the final reward was good and you won the game then you will increase all the probabilities if the final reward was bad and you lost the game you will suppress all the probabilities that makes sense and so here's a kind of visualization of the policy network that they took forget the middle column that's another thing that we will discuss later today but here's the policy network so the input is really an image of the whole board and then it's processed in multiple steps by a convolutional neural network because that's perfect for processing images and in the very end the output is also an image because it's the action probabilities on this board so where to put the next stone so if one of these green bars is high that means a high probability so it's quite likely that when you sample from this probability distribution you will put your stone there and if you have a smaller bar then it's a smaller probability and you just make sure that it's normalized and off you go so it's all nicely compatible with the technology of convolutional neural networks so that was of course also nice I should say okay there's a reason why there's a so-called value network so it's a bit more advanced technique than the standard policy gradient that we discussed also they had another kind of thing which was more search type thing so it was quite a bit more advanced than what I've been discussing right now but at the heart of it this is so to speak policy gradient okay and so here are some results and the results I'm showing here are already from a later paper so what you'll see on the vertical axis is some rating of the strength of the play of this neural network here's the training time in hours and don't ask me how big was their cluster so what they are showing here is several things so the purple curve would be purely supervised learning on expert players then the dashed line was already a little bit better this was in the first paper when they started with supervised learning but then also introduced reinforcement learning and I should say something about the reinforcement learning so how do you do reinforcement learning for a game when it's a video game you can just run the video game many times but when it's a board game it seems like you would have to play against human players many times but this is completely out of the question for the number of games they played here how would you have this army of super great human players that is willing to play for years against your computer program so what they did instead was something very smart they took the current version of the program they freeze it at some point and then they let it play against this version so against itself or against a slightly earlier version of itself and then this is perfect because first you get rid of the human players but second also it's always roughly compatible playing level which is super important because if you were always to play against the best player in the world and you always lose I mean you don't get any reward you don't get any reward signal you just get minus one, minus one, minus one, minus one you never will get any gradient you will never change anything and that's of course also true for humans I don't like to play against a really strong chess player so you don't get any reward signal whereas if you always play against someone who's about as strong as you are then you sometimes win, you sometimes lose and then you get really a reward signal and you can improve and so they improved in this way against themselves okay and so now the blue curve is the fantastic thing the blue curve is their later version which was called alfaco zero and it was called zero because it did not start by training on expert moves it started completely from scratch with some random probabilities and playing against itself and of course in the beginning it's super bad and it doesn't improve very quickly because it is basically just learning about the rules of the game from scratch eventually it becomes better it becomes as good as even the earlier version of alfago and then you see this little jump it really takes off and reaches a completely different level and so that is very surprising and remarkable and it also relates a little bit to what one does observe for humans so there are many stories of biographies of famous scientists who were so to speak self-taught and got to a completely different level probably also because they were first struggling very much to teach these things to themselves so apparently this is something if you only learn from the experts you are getting stuck in a certain mode of thinking but if you do it yourself you can reach a new level and so this is the quote I brought in the beginning that it was then doing moves that no human player would have dared to do yes, yeah, yeah I always assumed it's the human scale it's normalized to a human scale but I'm not an expert in goals okay, good point so at least for the expert game database they could probably do something and I don't know whether that's sufficient to calibrate it but I'm not an expert, yeah okay so yeah, so any questions still about this? yeah, legal actions that I cannot take, you mean? ah yes, but here it's a relatively simple game so they will probably mask this so all the probabilities for illegal placements would be zero, for example you cannot place a stone on top of another stone, yeah but that's relatively easy so the easiest way to do it is just to multiply with zero all these places and then to renormalize the rest and so that's fine okay, good so let's move on so this was policy gradient where the trick was even despite having discrete actions you can turn it into a continuous problem for gradient descent by introducing probabilities and now we introduce the next big class of reinforcement learning methods that's called Q learning and there's another trick of how to go from discrete actions to continuous numbers so the trick here is to introduce a so-called quality function that's where the name Q comes from and the purpose of this function is to predict the future expected reward for a given state and a given selected action so if I'm in this state and I'm selecting this action and then afterwards play according to the usual policy that I have and also according to the statistics of the environment what's the expected reward that I will get in the end and of course it's pretty clear that the best strategy if that is a given should be to select the action currently that has the best quality function because that's the best expected reward so it goes a little bit in the direction of planning but the question is of course how do you even get this Q function who will reveal this Q function to you? So there's a little bit also similar to things that people have been doing when programming computers to play chess or so there you also wanted to have a function that evaluates the position and says oh this setup of the board is very favorable for you this kind of thinking goes down here. Okay so here's our little robot again with boxes that it wants to pick up and now we can introduce several things so we could say if we look at each location as a state we could define something like a value this is not yet the Q function but we could say if I'm in the state and I continue playing the game what do I expect as a reward typically? And obviously if you start out relatively close to one of the boxes and even if you would have a random walk policy you would be more likely to get a good reward in comparison to if you start out very far away from the boxes so it's not unreasonable to assume that this value function I will define it more clearly in a moment would be bigger near the boxes and smaller away from the boxes and once you have such a value function this alone would already be pretty good for determining a policy so if you are at some spot you could look around and see where the value function increases but this Q function goes one step further it says it depends not only on the state but also on the action that you propose to choose so here I'm plotting the quality of the action going up independence on the state which is the location and obviously if you're just below a box and you decide to go up that's good because you will get a good reward even in the next time step already so if you then say you run for 10 time steps into the future and count the reward you expect it to be large if you're even a little bit away from so say in this lower right corner and you go up then of course you don't immediately collect the box but you have come closer to the box and so maybe in the next step if you take the next step and move to the left for example you will pick up the box so this is already good as well whereas if you are now above the box and you move up that's probably not so smart okay so that's visualizing the quality function so let's define things so I told you in words what the quality function is but what it really means is the quality function for a given state and a given action is the expectation value of the future return R given the state and given that I now take this action and to be more precise we assume that all the future steps just follow the current policy so the Q function is policy dependent if I do stupid things the Q function has lower values and so on okay so I already introduced this future return with the discounting factor if I like again ideally you don't have any discounting but sometimes things run more stable if you say it's more important for me to get an immediate reward and not so much what happens later and now I can also introduce this value of the state which is simply what you would get if you take the perfect action so you could say it's the maximum overall actions of Q of S comma A if you already adopt this Q learning policy which is to take the best action in terms of Q if you have any other arbitrary policy you would just average here also over all the actions that your policy will propose to you so instead of max you would have the average okay but now the big question is how do we obtain the Q function yeah so how would you obtain the Q function this mysterious Q function if you don't know anything else about the problem and so maybe yes again you can discuss with your neighbor so how would you obtain the Q function maybe there are some brute force ways of doing it yes and that's still a question yes so that's the return for the future so from the current time step onwards I'm summing up all the rewards that I will still get because the rewards that had happened previously I cannot influence them any more anyway yeah so if gamma is close to zero I'm very greedy then I say only the immediate reward right now counts and the rest doesn't count and then of course it's super simple then I can immediately write down what is the Q function yeah exactly the closets to one the closets to what I really want to optimize and then it becomes difficult okay so please discuss a little bit at least to start thinking about the Q function okay so maybe still discuss a bit more but in a minute or so I want some suggestions okay so did anyone does anyone want to mention an opinion how can we get Q maybe some simple method first what would be the simplest possible way to try to get Q okay so so let's first so I understand you want some scheme where you initialize the Q value almost zero everywhere but then you observe like in the picture that we showed that obviously when you're very near the target it's relatively easy to say what's the Q value because you will be very quick to reach the target and somehow you go on from there that's how I understand you and yeah that's roughly the thing that will happen we could have done something much simpler we could have literally taken the definition that I gave and say for each state in action I just run many trajectories according to the policy and take the expectation value and that's the Q value there and then I go to the next state and run many trajectories so this works the reason why this is a little bit wasteful is that imagine you have two states very close by you run many trajectories here you get the expectation value you run many trajectories there you get the expectation value of the return but if these are close by and I start here and then I move to the right so I'm in the second state it seems like the Q function that I have already calculated here should tell me something that I should be able to reuse this information somehow instead of doing independent Monte Carlo for each of the state action pairs and so that's the idea of Q learning which will work by a kind of recursive mechanism and that then implements in mathematical detail what you intuitively understood from the picture so the equation we start from is one of those equations that seems true but a little bit pointless it's called the Bellman equation it has been around a long time from optimal control theory and works like this it's a kind of recursive equation so the Q value of state S and action A is the expectation value of the future return that we know by definition so what is the future return? Well it's first the reward that I immediately get that I can calculate easily because given the state and given the action that I take I get the reward plus all the remaining rewards now when we have discounting they will be suppressed anyway by a factor of Q but that's a detail but what are the remaining rewards or rather what will be the average of the remaining rewards or some of rewards well that's again the Q function but now from the new state that I have reached so if I started in state S and action A I will reach a state S t plus one and then in this state again I have a Q function and again if my policy is the Q learning policy I will select the best action from that point onwards and that will give me the expected return and so I have been able to write down an equation that says the Q function of S expectation value of an immediate reward plus the rest is again expressed in terms of a Q function and it's exactly the situation that I told you if I have two states nearby and I know that in the first step I go to the other state I should be able to use the Q function on that state so this is how it goes it's also important that here at least some information about the reward is injected otherwise this whole formula would be very empty but the problem is of course that Q appears on the left and on the right hand side so it's one of those equations that I still have to solve okay so is this clear so the gamma forget the gamma the gamma comes in only because of discounting but what is written here is really the first step reward plus and then there should be the sum over the rest but the sum over the rest is again a Q function of the new state and for this particular policy it would be given by the maximum over all possible actions because that's by definition how the policy works if I had an arbitrary policy here would be an expectation value according to the policy over the different actions that it can take good so now how do we solve this and well the simplest thing you can try when you have such an equation is some kind of fixed point iteration yeah so I don't know if you say X equals cosine X then you know that on your pocket calculator you can just type the cosine cosine cosine many times and eventually you will convert and we can try to do something similar here so we just say we have a guess for the Q function and we insert that on the right hand side and use it to update to get a new guess for the Q function and then again I insert it into the right hand side and so on and so on and so on and if I do it in the correct way hopefully I convert now this may be a bit brutal so instead of just inserting on the right hand side and then immediately updating completely maybe I just insert on the right hand side and move a little bit in the direction of this so I have smaller steps that's a bit more stable so what I can do is this iteration where the new Q function is the old Q function plus some small number times whatever is on the right hand side minus the old Q function so that's just going a little step in this direction instead of doing a full update step if alpha were equal to one I would be doing a full update step because then this Q old and that Q old would completely cancel okay so that's a fixed point iteration and now we're good and what will happen now I show in a moment I believe is exactly graphically what you suggested and of course as usual we can use a neural network to approximate Q so even if the state space is large then it will be able to do it also the advantage of a neural network is always it learns to interpolate so maybe I cannot go to all the possible states because there's too many of them but the neural network will learn to interpolate so what it has learned in this set of states that can also extrapolate to other states okay and so here's an example again here's my target where I get a good reward and here's the Q function after the first step of this update rule that is given by the Bellman equation so I already learned that if I'm in this spot and I'm going up then I will get a good reward so this has some value of the Q function or the rest is still initialized to zero but now I apply this Bellman equation again and now I will learn okay if I'm here I will not get an immediate reward even if I'm moving up but I will still be good because there there is a Q function so it's a kind of two step process and here I got something because if I'm moving up and then moving left and I do not show this part of the Q function here because I can only visualize one of the four actions but then I also got a good reward so again by the update rule this also now gets updated to a good value and so on and so on so it spreads it's like an infection that spreads from the center and the Q function will be updated everywhere and eventually it will converge to something are there any questions about this it's a very nice way and it's somehow kind of easy to understand so imagine a labyrinth there's some spot that I want to reach if I'm very close by it's obvious where I should go then if I'm two steps removed at least in the next iteration it's obvious where I should go and so on and so on and so this will spread throughout the labyrinth there's still another thing so initially the Q function is of course arbitrary maybe it's all zero or it's randomly initialized and then I may have a problem because if my policy is really done according to select the action with the best Q I may select very strange trajectories at first and maybe I never, I may be stuck in the wrong actions that's what I want to say and so what people do is they select with a certain random probability epsilon they just, with a certain probability epsilon they just select a random action so that you can explore different possibilities and so that's called exploration and whenever you instead of the random action really follow the Q function so the max of the Q function that's called exploitation and later you can reduce the randomness when you're close to converging and so here I will give you an example again a relatively famous example again from the same company so this was the big example that put them on the scene and they were bought up by Google for an insane amount of money on the strength of seeing this example so what they did was combine deep neural networks with image recognition that had just made their big splash in 2012 combine them with reinforcement learning and they did that for video games so these are the old fashioned games from the 80s and these days you can run them in an emulator so you don't need to use the actual hardware and so the thing was the neural network learns to play these games purely by observing the video screen that's the state input then outputting actions that is typically just moving up down left right and getting as reward this high score that these games will give them and so at least in many of these action games it became really good and there were then other notable examples where it didn't become so good where you require more long term thinking but by now these kinds of reinforcement learning approaches are also good and so that's really Q-learning so you would take the image as an input have some a few convolutional layers then maybe some fully connected layers and eventually you would get the input and eventually for the different possible actions you learn the Q function so these are not action probabilities like in policy but really this is Q of S, A, S being the image and A being one of these discrete actions so that's the output of this neural network and then you select the one with the biggest value and they then went on to say visualize the knowledge gained by the neural network so this is as a color scale the Q function for different states or maybe it's the value function because it's only the states, I don't remember and so for example here you see if you're looking at a screen with many so this is one of those space invaders games where you have to shoot down the alien spaceships and if you're looking at a screen which still has many alien spaceships then the value function of this will be large because you expect oh now I'm going to get a high reward because I'm shooting down all these alien spaceships so this is visualized in terms of color and then what they also did was have you talked about TSNE, this visualization technique it's one of those visualization techniques where you take a very high dimensional you can take a point cloud in a very high dimensional space and project it down to two dimensions and people do this for example if they're looking at the activation vectors inside a neural network which are very high dimensional vectors so they run the possible inputs through the neural network look at the vectors that come out they get a point cloud in a high dimensional space and they want to project it down to two dimensions to visualize something and they did that here for the last hidden layer of their Q network and so they understood then that say similar looking images were grouped together by this neural network so these are images where I don't know only half the alien spaceships in the lower upper right corner are still present and they form some part of this space okay, any questions still here? Ah, no but at least the policy so do we agree that once the Q function is perfect and it gives me the expected return I should pick the action that, well maximizes this expected return right? I will not just take this or that action right? And so the expected return given the current state for the perfect policy is just defined by the maximum over A and that's why also in this Bellman equation I put the maximum over A because I want to follow the perfect policy. If I still had a stochastic policy maybe a policy that doesn't come from Q learning but from something else like policy gradient I would take the expectation value with respect to the policy of Q of S comma A but for this particular policy it's just the maximum. Okay, good, so then I want to say a few words still about what people are using nowadays and what they are using nowadays is really they are combining both of these approaches the policy gradient and the value function based approach and it's called actor critic and it's called actor critic simply because you think of the policy of the agent like describing what the actor, the agent should do and then on the other hand you still keep a value function that's the critic that says oh you're doing good and you're doing not so good so that's the critic or the value function. Okay, so policy gradient on one hand Q learning and value based approaches on the other hand and then the intersection the modern actor critic approaches. So what's the idea? Remember at some point I told you that there's this concept of the baseline and just to recall you have the reinforcement learning update that in policy gradient teaches you to go along the direction that's defined by R times the grad of the log of the action probabilities but you can play around with the return you can for example shift the return by a constant and you still will get the same exact update on average however the variance can change and that's important. So if you have an update rule in machine learning in general and you can somehow reduce the variance and still keep the right average direction that's obviously good. And so the idea behind these actor critic approaches is to use a baseline that now even is not just purely merely a constant but that can depend on the state and reveals the expected return from that state and what's the expected return from that state? Well that's the value function that we introduced before. So the value function again what is it? When I want to re-express it in terms of the Q function is the expectation of the Q function of S comma A given that I'm in a certain state S and I'm taking this expectation over the policy that selects my actions could be any policy need not be the Q function policy and that's the value of the state S. So what's the expected return if I'm in the state and follow the optimal policy? And so if I introduce such a baseline if I'm comparing my actual return against this expected return I'm really answering the question, okay I was in the state, how much better did I end up in the end than what I should have expected on average? That's an important question because it's not a big deal if you are in a good state and you're doing well in the end, well that's not surprising, right? If I give you a million dollars to start with and you end up with 900,000 dollars is that good or bad? It's actually bad because you lost money but in comparison to many other people you're financially still doing very good so it's really this how you moved compared to the expectation that should be counted. Okay and so what we introduce here is the so-called advantage the advantage over the average that you did and the advantage is really if I'm in a certain state and do a certain action how do I compare against the average results from starting from that state so against the value functions of Q of S, A which is action dependent, minus V of S which is averaging over all the actions according to the current policy. How much better does the action A perform than what one expects on average from the state? And so just to go through the math a little bit the first little rule that I want to use is written down here. So if you give me two different random variables and one of them only depends on X and the other depends on X and another quantity Y then the expectation value of the product I can also rewrite in this fashion. I can say I take the expectation value of this F of X, Y with respect of Y given a fixed X and then I multiply with G of X and still average over X so it's so to speak averaging in two steps, okay? And so if I use this I can now say the following look at this so this is the reinforcement learning update this is the R times the grad of the log of the action probabilities and so now I claim it's the same as if I replace the R with the Q function and this is true because the Q function is just the expectation of this first term and if you think a little bit about it what I'm using is just this formula up here so I'm replacing the R that's sorry that's still fluctuating with the Q that is in principle the expectation value but conditioned on the state and the action and the state and the action are the only things that appear here on the right hand side so this works out. So that's the first step I have replaced my usual reinforcement learning policy gradient update with something that includes the Q function but that's an exact step and now I introduce this idea again to introduce the baseline to reduce the variance so instead of using the Q functions at this spot I'm using the difference between the Q function and the average of the Q function over the different actions that I could take so the value and again this is not necessarily the optimal baseline but to compute the optimal baseline would be much more complicated and it's a good enough approximation, good. So now this Q minus V by definition is the advantage is called the advantage. We still want to approximate even this and so the advantage I can write so minus V is just the minus V from before but the rest here is an approximation for the Q function so to speak the immediate reward that I get plus this discounting factor times the value of the next state. Okay, so now this is the advantage formula that I will use so as to get rid of the Q function so now there's no Q function appearing anymore and the V I will actually learn from a Bellman type equation so one can write a Bellman equation not only for the Q function but for the V function it looks almost the same I've written it down here and then one can apply this iterative fixed point update rule that I discussed for the Q function one can apply to the V function again with a small update step alpha and so what people then do is to put everything together they use a neural network for approximating the V function so it's a neural network where you put in the state and out comes a value which is the value of that state and then they use this update rule whenever they go through a trajectory to become a little bit better in the approximation of the value function so now overall we have two things we still are using a policy and we are updating the policy according to this formula but using a nice good baseline so effectively using the advantage and we are also using a value network in order to be able to calculate this advantage and so that's then finally how these actor critic methods work let me see so this is the last slide of this technical part and then I have to see how I proceed so what do people use nowadays so if you don't know anything else what I propose is a technique called PPO Proximal Policy Optimization that is one of the many advantage actor critic techniques it's very stable, very robust you can use it not only for discrete actions but also for continuous actions and you don't need to program it yourself so just as with the rest of machine learning by now there are multiple frameworks out there and I list a few of them stable baselines started out for TensorFlow then nowadays as PyTorch then there's TensorFlow Agents obviously for TensorFlow there's pure Jax RL for Jax and there are many others so you can really pick whatever you want and the nice thing about these frameworks for RL is you can relatively easily exchange one approach for another I say relatively easily because different approaches have different hyperparameters and so on so whenever you do this you still need to at least look in the description and make a few choices but it's relatively easy okay so let me see so we still have 20 minutes or something like this yeah and then there would be the break and then there would be the next session okay so first I pause for questions and then I can still tell you something about continuous actions so are there questions at this point yeah yes yeah so or baseline it's called baseline really ah yes absolutely yes so yeah just to remind everyone yeah so this dramatically teaches everyone that if you don't do any of these baseline things you get the green line for the variance which is really bad has high fluctuations and scales also in this case badly with a number of steps yeah yeah here this is done with a so-called optimal baseline and in this particular example I mean even showing how it would be calculated but it's a little bit harder to calculate and people don't actually use this yeah so what this advantage ectocratic thing is already an approximation to this optimal base like questions here so then let me switch to the blackboard and still tell you something about continuous actions that's also very important so imagine you are doing some physics so imagine you're doing some physics application and it's relatively likely that you will want to steer your physics experiment with continuous control so instead of the actions being up down left or right it's much more likely that you want to send a microwave pulse down to your quantum experiment and you want to choose the amplitude and maybe the frequency of this pile so these are obviously continuous parameters so that would mean that the action is really something continuous and now how to do this yeah so one possibility would be to discretize this continuous range of actions and then you are back to the case of discrete actions the problem with that is well the first problem is the finer you discretize the more of these actions you have so the more output neurons you will use but the bigger problem is if this is now an action in a higher dimensional space so let's say two dimensional space you could still discretize it but the number of little pixels in your discretization now goes quadratically with a number of discretization bins so like n squared and so this really becomes painful so that's not a good way to proceed so what do people do when they have continuous actions and so here's the idea so instead of outputting action probabilities on a discretized grid you say my action probabilities are really like a Gaussian so this is a and this is the action probabilities that my neural network produces so to speak but how do I do that where the neural network you can ask the neural network to predict the center of the Gaussian and maybe also the spread of the Gaussian so what will happen is you have your state as input and you have all your hidden layers and then you have in this one dimensional case only two neurons one produces mu the other produces sigma and afterwards you just apply the usual gradient reinforcement learning policy gradient update rule so you can calculate by hand the grad log of this Gaussian distribution but I could write like this mu and sigma but these mu and sigma themselves they are outputs of the neural network so I can write them like something like mu theta of s and sigma theta of s and so the log of a normal distribution is easy to calculate and in the next step you apply the grad to the mu and the sigma and that's done by your automatic differentiation in your neural network and so this is the way people introduce continuous actions and this scales very nicely if you have a high dimensional space because you then just have more mu's and more sigmas and so you go to a high dimensional space so no problem with that and so now you can think about what really happens here you could have thought I would just have a neural network predict a single output A that's another choice you could make the neural network is asked to produce one continuous output A but that's not good because it goes back from the original concept of having probabilities as outputs and you can also see why it's not good because then you will never explore what would have happened at other values of A so the point of having this probability distribution or in this case the Gaussian is that you always want to wiggle a little bit so sometimes when you sample from this Gaussian you will take this value sometimes you take that value and so on so you wiggle a little bit around the possible continuous values A sometimes you get a higher reward sometimes you get a lower reward and this is exactly what gives you the real signal that teaches you in the end whenever I randomly go to a slightly smaller value of A actually I get a higher reward so basically overall I should be probably be moving my mu to the left and so this is the thing that people do for continuous actions okay any questions on this? yeah okay very good question so what people do in practice is typically they only put the diagonal elements of this covariance matrix so this individual sigmas I have not really seen applications where they put the full covariance matrix but maybe if you maybe you have some physics insights into your particular application and then maybe you know that I don't know there is a good reason for making them dependent on each other but it's not that important because in the end you are in most cases that one can think of eventually your policy will still converge to a deterministic policy so meaning that the sigmas will actually shrink it's just important during the training that you have this little wiggle room but it's not so important that these are wiggling in a correlated way so to speak it's just that each of them has a little bit of wiggle room that's another possibility and some people also do this so maybe the sigma is not something that the network predicts but the sigma could be something that you choose that's what you say that you choose during the course of the training here it's maybe a little bit nicer if the network predicts it because it can somehow adapt itself because you maybe don't know what would have been the good choice of sigma in a certain situation here the network can start with a very broad very large choice of sigma initially and then maybe quickly realize oh no I know pretty well which deterministic value of A I need to reduce my sigma very quickly so that's the only difference but yes one could do what you say so I guess since I don't believe I have time anymore to go into some Jupiter notebook because that would be quite another thing I would say we stop at this point and make a little bit longer break and then we come back and then I show you some physics examples okay thank you