 okay thanks for the invitation and thing let me disclaimer I'm not an expert that's why it's a tutorial talk otherwise I would have given a more expert level yeah so just to show fans how familiar are people with deep kind of reinforcement learning here does the well a mark-out decision process does anyone know what's the policy gradient so I kind of try to tune it to that level I think I'll try to keep it fairly basic yeah this is going to be a weekend morning I didn't want to work too much into the details but it's still going to be a bit fairly math heavy yeah so just bear with me so the the title is split it into two parts what's called reinforcement learning and then there is something called said deep reinforcement learning so the first half of the talk around 20 minutes will be just getting the notation of the deep reinforcement learning and then towards the end we will kind of shift gates towards the deep reinforcement learning and mainly we'll focus on one algorithm which is called as a policy gradient and we'll try to derive that right so I don't plan this tutorial to be fully exhaustive so this is just to give you some pointers to get started from where you can read up further yeah so essentially the motivation for deep reinforcement comes from this fact right as humans we kind of learn by interacting with the environment think of an infant trying to walk or you're trying to learn to balance the bicycle yeah so not explicitly going to teach you given examples of that okay you're sitting in the right position so you get a label one you're sitting in the wrong position you get a label zero etc right so you essentially you go deep dive into the environment for example you try to walk you fall down or you then you kind of get hurt that's your reward minus one or you kind of get up and some certainly you know that is trying to figure out the balance so you get reward of plus one and then you kind of slowly you get a sense of what it means to walk or what it means to ride a bike but just trying out things right so that's essentially where humans are learning no one giving you there's no explicit teacher as such saying okay this is one zero one zero one zero in a conventional machine learning framework yeah and that's essentially going to be the motivation for reinforcement learning and so the couple of things here right as humans we are acutely aware of how our environment responds to what we do right that's the key thing here right we are acting in an environment we do something and we kind of get feedback right so we know what the how the environment is responding to the actions we take and second part is the most important part right we seek to influence what happens through our behavior right once you get the reward I kind of now control how I should do my actions so that I can get my rewards better yeah yeah so essentially a reinforcement learning is just a computational approach to kind of formalize this notion of learning yeah and this is what we call as learning from interaction yeah so in contrast with we have this supervised learning where you're learning from labels then this unsupervised learning no labels so it's kind of the way to think is that this is a way of learning from interaction with the environment yeah so the technically a reinforcement learning agent learns from interaction with the environment to achieve a desired goal yeah in summary this statement essentially I think the rest of the slides are just going to be how do I mathematically concretize this response statement right what is the mathematical formalism for just to make sure how do you kind of model this statement learn from interaction with the environment to achieve a desired goal so there are kind of I mean I want going to a lot of applications there are quite a lot of applications like for example where this motivation is coming from but I started with the way the humans learn as a motivation but that's what not we're interested in we're actually interested in a lot of various problems right for the classical example is learning to play games control systems in robotics or even recommended systems right can be cast in the deep in a reinforcement learning framework or even ready or at a very high level even some standard supervised learning problems kind of if your problem with certain non-differentiable loss functions can also be cast as a reinforcement learning problem yeah so it's kind of a it's a very new paradigm I mean it's not a new paradigm but it's a very powerful paradigm if you somehow make it work yeah because almost any problem you can think of I mean you can cast it in a reinforcement learning problem and that's what I want to get to you today right I'll give you the basic formalism and you can think of your own setup whatever problem you have how do you formulate in that and then there are certain standard algorithms which you can crank the machinery so I'll just briefly flash a couple of examples before we get into the details right so this was in 2013 I guess where who a deep mind showed that you can actually learn to play Atari games just from raw pixel violence right and so the examples I'm showing are some kind of success stories in deep reinforcement learning rather than reinforcement learning so reinforcement learning has been there for quite a long time and but it is not successful mainly because of this till deep learning came in because of this infinite state space it had a lot of computational issues but after a couple of these since 2012-13 there has been kind of a tremendous progress in deep reinforcement learning yeah so probably I think just let me just make a comment I think these are the two talks the way I've arranged I think these are probably currently the most hot areas in deep learning one is the deep reinforcement learning and after this the second talk which would be the GANS right probably that's why I think you can see probably if these are almost on the cutting edge of the currently the deep learning setup yeah so that at least yeah so the what I in this what I want to point out is that to learn this all you need is just the pixel value right of the Atari game and they were able to figure out how to play the game yeah then the Google's Alpha Go which defeated the Go was essential another in I think so in 2017 right this was again purely based on kind of the deep reinforcement learning framework then so there are other examples lot of examples in robotics right how do you make a creature jump in a terrain how do you try to control it those are a lot of examples coming up in robotics and this is an example in Berkeley there is something called as a preschool for robots right essentially the way humans like kids go to preschool you can have like robots go to preschool where they just keep trying learning things and essentially the whole loop the closed form loop is a reinforcement learning loop where they keep trying how do I lift here and keep it there how do I grasp something yeah and essentially they're just sitting and trying and doing that for days together that's essentially but we'll come to the formalism later but that's one example to think of yeah so I'll also give one example in a recombinative system and this is exactly also what I'm trying to some work going on in our labs which I'm trying to cast even a recombinative system as a deep reinforcement learning problem say for example you say I'm looking to buy some dresses you get some items in the catalog then you say the user actually clicks on a blue dress then it shows some blue dresses then again the user converses they show me only sleeveless then again it picks up sleeveless dresses etc yeah so this is the way you again see that you're interacting with the catalog you're making some actions the system is showing you some results and you're getting a reward in the end what is the reward someone adds that item to your cart right again how do you formulate we'll come to that but again this sort of simple multimodal conversation system or learning your recommendations can itself also be cast as a deep reinforcement learning problem yeah so I'll also point out to this thing site where it's open AI gym which is essentially a lot of these examples which I'll talk about are coming from there which is essentially a series of video games think of it as more of an estimate to test your deep reinforcement learning algorithm right so you'll get an example of some game where there is some AI agent which is one side and you are trying to your reinforcement learning algorithm to kind of win the game yeah so there are a lot of games okay so that's the introduction now I think we want to do a deep dive into the deep reinforcement learning framework yeah so I'll again come back to the formalism which is an agent learning from interaction with the environment to achieve a group yeah and this is what we want to formalize what is an agent what is an environment what do you mean by an interaction what is a goal and how do you achieve that yeah so as I said earlier so this kind of there is a sole concept of supervised learning where you're given explicit labeled examples to learn that this unsupervised learning and what we're talking about the third paradigm it is called as a reinforcement learning where in lot of examples it's often practical to obtain labeled examples which cover all possible scenarios that's why it's a very natural way to kind of familiar problems in the reinforcement learning so before we even understand deep learning we need to need to speak the jargon of deep learning right if there are I mean if you if you think these are all very familiar then please raise your hands then I won't cover but this is exactly what we're trying to cover right what is an agent environment state action reward policy return value function model right this things any problem you think of if you can actually map your things to these things then you know how to cast a problem in rl framework yeah so we'll just go one by one and that's essentially the main one yeah so my references for these are two things where I prepared the talk so one is this Richard Sutton's book on reinforcement learning there is a new edition I think the the book is pretty old now so he's writing a second edition of the book that is fully available online and you want to go deep chapter ones to one two three are the things where we'll get a good sense of reinforcement right but after one two three you can actually stop after that it's all classical reinforcement learning then you probably deep reinforcement learning earlier not netx books out there yet but kind of the example I'm taking is from this but Carpati's block on the deep reinforcement learning that's the working example I'm using to explain some of the concepts okay so let's dig deep now yeah so so essentially we are trying to formulate this agent environment interaction yeah so again we'll reiterate the same thing so an rl agent must be able to do three things it must be able to sense the state of the environment yeah it must be able to take actions that affect the state and the third one the agent must also have a goal relating to the state of the environment yeah so this is the diagram I think probably try to memorize right this is what we'll have to keep using back and forth again yeah so there is three things here so the agent and the environment yeah so the agent is the learner and the decision maker is called the agent you have an example okay and the environment is the thing it interacts with is called the environment so an agent is someone you're trying to learn right an rl agent some trying to make the action environment is something which is completely outside the agent yeah that's how you think of the environment so the boundaries are all dependable the problem it does not necessarily mean that the environment and agent can also be the same person or get this same robot essentially where you demarcate I have to take an action whatever is you outside the agent is environment yeah so the agent actually takes an action yeah the agent selects an action to perform yeah it could be here that I'm trying to get out or trying to walk right that's an action so the environment actually gives out a reward for the action right you take action and you get a reward out and the state is essentially once you get that reward the state gets modified yeah so I'll come we'll again react at these three things but things you need to remember are action reward and state so the agent takes an action the environment gives you a reward based on that the state changes then again the agent takes the next action and this is a control loop or this is a closed-form loop that keeps continuing yeah that's the basic formalism of an rl form so we're going to use this example throughout yeah so this is what is called as the one of the games in the open gym it's called the pong game yeah so let's spend some time on what is that we are trying to do here there is a video here but it does not work but so let me just directly go to this part so so this is think of this as a classical ping pong right whatever your table tennis so what do you see that there are two paddles here so this is one paddle here that's another paddle these are two ping pong parts yeah so this paddle is controlled by an AI agent that's essentially the game who is controlling yeah and this is you you are playing this paddle yeah so your goal is essentially you can put the paddle up down etc and there is the ball here yeah so you essentially take this paddle hit the ball and it goes up here and then this guy hits back and that's how you play the game we're essentially playing a ping pong on this grid of say 200 cross 100 pixels yeah and every time say you you hit this ball let's say the AI agent or the other the play the game cannot intercept it and it goes out of the boundary then you win yeah so you if kind of the or if you essentially this guy hits it and it goes out and you cannot intercept it then you lose yeah so if you hit once you get a reward of one or think of you lose you get a reward of zero so essentially there's a counter here which says how many wins you get right and that's how the game proceeds so essentially classically just a TT table tennis game right you're just passing the balls anywhere and the only what the goal here is that I need to figure out how should I hit yeah so the green guy what you see right that is the rl agent so I'm going to play this game for many times many many hundred thousand times and figure out how to game play this game perfectly right without any supervision yeah no one is going to tell me how to play the game right you have the game engine which is your playing with think you think that you're going to play with an opponent yeah and somehow you're going to play around and then figure out how we're going to win yeah so that's the so that set up is pretty clear right so the issues on that so the agent plays one of the paddles and the other paddle is controlled by the game and your goal is actually to learn how to play the game yeah so that's the reinforcement learning framework let me just briefly mention one point that if you want to learn this right the AI engine which is the orange one it cannot be a perfect engine right if it's a perfect engine that means that you're never going to win and you're never going to get any feedback on what kind of actions do I need to make to win so typically generally this game system is also an imperfect AI system right so you're actually you have some chance of winning but you are going to play around to figure out how to win and lose so you want to start playing you're going to win some you're going to lose some and eventually you want to learn a policy or a strategy how I'm going to win against the opponent yeah the opponent is perfect it's nothing I mean you'll probably you won't even get a chance to learn anything unless he explicitly teaches so so in this context okay so I'll tell coming again what is the agent agent is the green paddle yeah what is the environment everything other than the green paddle the entire array of the whole game that is your environment what is your action I can take either the paddle up or down right essentially I'm moving paddle up and down to intercept the ball that's why that is the action so what is the reward either I win the game I lose the game or nothing happens because I get is just in some state of the game and what is the next state state is essentially the whole image right so you you get the ball at green you you once smash it goes around and then you get some reward and after that the state changes which is essentially the new position of the ball yeah so that's how you extract this problem in this so now we'll start introducing a little bit more notation yeah so we call the action as an AT yeah we call the state ST and a reward RT yeah and a game essentially is a sequence of this right you start with state S0 you get an action A0 you get a reward R1 state changes to S1 you take an action A1 you get a reward R2 you keep continuing till you hit a terminal state and what is the terminal state you either win or you lose at that point you stop the game restarts you continue that's essentially how you're playing the game and this is a mathematical abstraction to that game play so again I'll read a little bit more so the agent receives a representation of the environment which we call the state ST yeah and the state S belongs to this capital S which is the set of all possible states yeah on the basis of ST the agent selects an action AT which is also belongs to some space yeah one time step later as a result of the action the agent receives a scalar reward RT plus 1 yeah again so this is a little bit if you look at the literature there is some confusion between RT and RT plus 1 but I prefer this notation where you get ST you make an action AT after one time step you get a reward RT plus 1 and then the state of the system changes to ST plus 1 yeah so that's still closed form loop yeah so a little bit on how we want to do is actions again to generalize this actions can be any decisions you want to learn moving right so anytime you think about problem you need to think what is the action right what is my action here the action is very simple and move the paddle up or move the paddle down but okay and state what is state state can be abstracted by anything you think the action has to be influenced by it does not have to be a perfect everything right it can be a simple thing what P set of features you think will be responsible for the action right it's all the game of how you model your state yeah and in the end the agent's goal is actually to maximize the total amount of reward it receives over the longer yeah but I won't formalize this statement yet that's what we'll do in the next couple of slides so again we'll come to this pong example right so what is state now more precisely state is nothing but this 210 cross 160 cross 3 RGB image yeah again this is how we'll be where each pixel in the image is some number between 0 to 255 so think as this roughly 100,800 numbers is your state yeah which is what you want to do yeah your state is a vector of 100,000 dimensions which is I just flattened this 800 cross 210 cross 160 cross 3 image into a long vector yeah what is action action is either binary up or down right there are only two possible states and what is the reward reward can be three things plus one if the ball went past the opponent minus one if we missed the ball and zero otherwise zero most of the time you're playing it will be zero right because nothing is happening but only at certain parts you either win or lose and that's when you get the reward right and the whole point is to figure out how do I get this plus once a lot that's essentially so your goal is essentially to move that decide how should I move the paddle up and down based on what I see in the image so that I get lots and lots of reward yeah so before I move into something I'll just briefly something formalism called so you probably have heard the term markout decision process and a markout decision process or a mtp is another real task is a certain special property called the markout property yeah so what does it say yeah so this is essentially a way of kind of simplifying our state dynamics so what does that mean that I take some action I get a reward my state changes right something happens to the state which is essentially what I'm showing you here right so st plus one is a new state rt plus one is a new reward in principle this change can depend upon the entire history right you can go all the way back to your game or whatever everything it can depend right markout assumption states that it depends only on two things right which is essentially your previous history which is st and eight the last yeah so the one example which I always I find it very interesting because think of playing the game of chess right which chess is again a classic you can formulate an rl framework but anytime you want to make the next state so you all that you can do is depends upon the current state of the board right you don't have to know how you came to that board right it really does not matter whether you took an action from there you took the multiple ways of coming to that state but once you are in that state that's all it matters based on that I can figure out what should be the next action and reward I should take yeah that's essentially the markout property and I think every most of the problems you see kind of implicitly assume this is the markout assumption yeah even though the markout assumption may not be true it makes sense to actually actually formulate this is the markout decision process okay so now coming to the we need to speak the basic rl vocab so we know what is an agent we know what is an environment agent is the paddle environment is your entire great state is actually the pixel values action is move the paddle up and down reward is plus one zero minus one yeah just the reward you give so now we come to the other four things which are policy return value function and model yeah so once I get these four things I think you'll have a clean understanding of what is a reinforcement learning program yeah so we now come to something called as a policy so at a high level policy is something what is the action you want here I mean abstractly what is it I want to decide whether I want to move the paddle up or paddle down but at each step the agent has to implement a mapping from state to an action right given a state that is the pixel given this current how the current state of the game is I want to decide what should be the action I need to take either should I move up or not that formalism the way of mathematically writing that is called a policy which is what we use a notation pi of a given x given that I am in state is what is the probability that I will take an action a given that this image what is the probability that I will move up or what is the probability that I will move down yeah so most of the deep reinforcement learning will use this notion of something called stochastic policy right which is essentially you rather than taking a deterministic policy whether I get up or down all this computing some kind of what is the probability that I should move up what is the probability that I should move down and you start sampling for it right so if you say probability of going up is 0.9 going down is 0.1 you actually toss a biased point is probability 0.1 if it is helped you take that action otherwise you go right and it's very it's actually looks very simple but this is a very strong reason why you want to use stochastic policies because you want to encourage lot of exploration so if you don't use stochastic policy you are always taking the best action you are not encouraging too much of exploration so you want to take the best action most of the time but sometime you also want to take other action so that you can figure out some other possible ways of winning the game that's why the whole notion of a stochastic policy becomes important so that's the probabilistic formulation okay so in this what is it right so in this game of pong it's quite simple it's the probability of moving the paddle up or down right which is probability that the paddle is up or the probability that y is equal to 1 yeah given the current image which is this 210 cross 160 cross 3 image we essentially I'm trying to learn a function from this 100,000 numbers to a probability which says whether I should be up or down yeah that's all there is good what is the policy so we figure out now what is the policy yeah but now comes to the question okay I know what is the RL framework I know what it means to be what's the policy now how do I learn this policy right so what what is the formalism you can give me or what is the mathematical machinery you can give me to learn the policy and for that we need to bring the concept of a return yeah can you see I can't see the time there I just want to make sure I'm right okay so the agents goal actually so that there are three things right goals rewards and returns yeah so this is what we'll walk through yeah so again at a high level the agents goal is to maximize the total amount of cumulative return reward it receives over the longer so you saw that earlier every time you take an action you get a reward right but what we don't want to do is we don't want to maximize our our we don't want to decide our policy to maximize our rewards because they are almost instantaneous rewards they are like they're not long-sighted right what you want to do is you want to maximize the probability that eventually you're going to win right so that's what and what we call if we need to formalize this notion of what is actually a return return is the bunch of rewards you get over the longer and not just the immediate reward you want right because you're no one is you're just started you're playing one or two games you get a small reward and you don't want to base on so and that essentially comes to the most crucial step of modeling RL framework is actually the way to think is that you have some goal right that's what I say goal is I want to win the game yeah and the only knob you have to make sure that you win the game is how you specify your rewards so the all the kind of the creativity comes in how do you specify your rewards so that eventually by maximizing this cumulative rewards I reach my goal right if these rewards are ill-specified or I mean if you're not specified in the right way it'll try to only maximize that there's no goal in mind the mapping from goal who rewards is what you essentially have to bear on way how do you do that so before I continue to give them more notation I need to kind of define two kinds of tasks people talk about one is called as the episodic tasks and the other is called as a continuing task the example of Pong I gave you is an episodic task an episodic task has a natural notion of an terminal state that I start playing the game I win or lose and I stop I again restart I start playing the game so every one trajectory of a game is called an episode and then you play multiple episodes of the game as opposed to a continuing state there is no natural kind of time stop you just keep on walking or I mean there's no way that you figure out walking or some robotic manipulation task so there are other tasks but we won't talk too much about the continuing task I think for this tutorial will be mainly focusing on an episodic task and this is how an episodic task would be for the Pong game right so you have this state is just an image right and each time you take an action either you go up or down up down down down up up and eventually boom you either win or lose yeah when you win that is one episode so what I shown you are four different episodes of the game right you start playing the game you take some actions you win then again start playing the game you take an action then you lose so you'll keep repeating these are called episodes of the game so now we'll define this notion of what is a return for an episodic task so for that what you have is this notion of time steps right at any time you took an action at then you got a reward at plus one and from that you start continue playing the game and then you got the series of reward which is RT plus one RT plus two RT plus three till you hit the terminal state yeah and return is nothing but just a function of these sequence of rewards yeah that's it and it can be technically it can be any function but the most commonly used the function is just the sum which is essentially a sum of these rewards right so in our Pong example it will either be one or minus one because you start off zero zero zero zero win zero zero zero minus right but in general a lot of applications the rewards can be very different it does not have to be like zero one reward you can actually have the rewards and so technically a reward is just a sum of these things so you get this reward of plus one minus one minus one plus one this is the whole sum so for continuous task we introduce some other motion of called as a discounted return yeah which is essentially it's very simple problem because if you have a continuous task if you add up all the rewards it goes to infinity that's not tractable right so introduce something called as a discounting factor called gamma yeah which is essentially RT plus one plus gamma RT plus two plus gamma square RT plus three yeah so you can think of gamma sum number zero over one where you down weight the future rewards yeah and that's called as the discount rate so the discount retain actually depends the way say that it's the present value of your future rewards right and how much are you going to give importance to the reward which you will get in the future as opposed to valuing your current reward because gamma approaches one the agent becomes more farsighted because it's not to take this in the long one as gamma confuses it's a very myopic agent that is a very greedy it wants to do only maximize right now yeah and that has a factor of what kind of policies it will run right how you set the gamma okay so I talk this policy now and I called it return yeah so policies how do I take the action and return is essentially what do you call a sum of the rewards you got over the one episode yeah that's for it okay so we're still not there right so okay so the reason is this right so what I said is the agent's goal is to maximize the total amount of cumulative reward it receives over the long run yeah this part of long run I didn't put right so essentially you can think of GT is again a random variable right because the reward you are getting is for one play of the game right anytime you come you take an action there is a stochastic policy you can branch out then there's another stochastic policy you branch out so there will be multiple policies right so the agent's goal is actually to choose the actions to maximize the expected discounted return so what we need to now do is actually take an expectation over these different episodes right so there is one episode you start you win another episode you start lose right so you run this game multiple times at each time you get a kind of a either a reward of plus one or minus one that and then GT is just the sum of that rewards expectation you can think of just I do an empirical sum of this right what is the average return item yeah and essentially that's the RL framework is to choose the actions to maximize the expected discounted return right so we went from reward we went to return then you said discounted return then you said expected discount return and that's what you want to maximize and this expected discounted return is what is called in the literature as a value function and that's what you're trying to maximize and this is the most formal definition of what we call as a state value function yeah which is the value of a state s under a stochastic policy pi is the expected return when starting in state s and then Paul continuing that policy thereafter yeah so you started state s and then you play the game I want to know what is that value of that that's the technical term for validating which is nothing but to start that you keep track of all the rewards you will get but realize that this rewards a stochastic so you take an expectation of that yeah and that is essentially the state value function which is how much reward will I get from state s under policy pi there's something called an action value function which I will skip yeah these are sales yeah so for now this is all we need because I'm kind of running down time so you have this policy then I said what is a return which is the discounted return then I said expected discounted return is the value function and an RL agent will try to maximize that I'll also skip this part because we don't really need become I need to come to the deep learning part so yeah so there are broadly there are three classes approaches to real framing which is called as a value based reinforcement learning policy based reinforcement learning and the model based RL yeah so now I know totally what is the value function so the value based approaches start trying to make learn this value function yeah policy based RL actually takes a different paradigm they directly try to estimate the policy without considering the value function and model based try to build a model of the environment I think for now let's focus only on the policy based RL and that's actually that's the one which is showing a lot of success and that's probably a little bit more hot right yeah and I covered till now essentially chapter three of this book right now you can go ahead and say all chapters four to 13 is what is called as a classical reinforcement learning right so with the advance of deep reinforcement learning I can probably safely skip all these chapters and directly come to something called as a policy grade right that's what I'm going to do but if you are really I mean you think that you need to understand fully please be to read all this chapter 4 to 13 that will show you all these classical ways of solving the reinforcement learning problems yeah now we are just going to jump directly into deep reinforcement learning with this motivation I think it's pretty simple to see what's happening yeah so again I'll recall a little bit of notation right so the format is always this notation I think that's why I keep putting this again and again this always have to be very clear about what are we talking about right so that's a goal that's a reward which is the RT return is the cumulative discounted return over the long run value function is the expected discount return policy showed you how to take actions and the agent's goal is to find a policy to maximize the expected discounted return yeah so deep reinforcement learning essentially is that right I told you that there are three ways of doing it value based RL policy based RL and model based RL so why not use deep neural networks for to approximate the value function use a deep neural network to approximate the policy function or use a deep neural network to model the environment yeah and those essentially are the different thing right so the earlier successes that the Atari paper came from something called as the approximating the value function that's technically called a deep Q network or the DQNs yeah I'm going to skip DQNs for this entire tutorial and we're going to focus on something called as a PG which is the policy gradients we directly try to optimize the policy yeah and there's another algorithm called actor credit yeah so if you want to go back and learn again I think PG and AC are the things you really want to focus on yeah because that's essentially where they are the much easier to understand and that's there where they're showing a lot of potential yeah so we'll focus on this algorithm called policy gradients yeah so what I said yeah so first notion we need to get is that I defined this policy stochastic policy which said how do I take an action so now I need to start to parameterize this yeah so I'm going to say that this policy depends on some parameters theta yeah and how does it works in our place form pretty simple right you take your image in simplest case you could just pass it through another neural network or a deep neural network or it could even be a cnn right so I took this kind of 210 cross 100 cross 3 image pass it through a deep neural network it could be a cnn or any neural network and the output I put a softmax layer which is either 1 or 0 1 means I move the paddle up 0 means I move the paddle down that gives me a parameterized policy of moving my paddle either up or down so what is this modeling given any image sp what is the probability that I take the action 80 is up or down right that is essentially the policy network yeah and the whole goal is now try to learn the parameters of this policy network in the reinforcement learning paradigm we discussed earlier so the goal is to learn a parameterized policy that can select actions without consulting the value function yeah but technically not that value function will still be used to learn the policy parameter but it's not required for action selection as in the q based other value based algorithms yeah those familiar with that so again we'll do the standard deep learning trick right so I have a policy parameter theta which are the parameters of this cnn yeah and I want to maximize the performance measure called j of theta as usual we'll do stochastic gradient descent the only thing is this is what you are trying to maximize it so I'll do a stochastic gradient ascent so theta d plus one is equal to theta d plus alpha into gradient of this thing yeah so essentially the whole literature probably I can you can write the 22 30 pages and how do you what are the different ways of computing this gradient that's all the policy gradient is about yeah so I think so I think so again now we'll come back to the same rate so essentially we said we want to maximize the discounted return so we are going to the performance measure would be that value function we're trying to maximize so given a start t I'm going to play the game and I'm going to get out of rewards how do you want to maximize yeah so the whole thing is how do I find the derivatives of that right and this is a little bit I'm not because lack of time I'll keep this this probably is the most crucial point on how do you compute the derivative there are two standard tricks people use yeah the one is called as the log derivative trick and the other is called as the gradient of the episode yeah so the log derivative trick actually takes the gradient of an expected value of a function of f of x so this is gradient of expectation of f of x is nothing but expectation of f of x plus this term yeah this is the trick we'll need to forget the derivation but eventually this gives you a way to push the gradient inside the expectation yeah and that's very helpful because it helps us to do now the expectation where empirical average is the second is actually we also need to compute the log probability of an episode the way to think about is that I run the game I get a bunch of states so I just want to see what is the likelihood of this region right so and that's where all the dynamics of the the way you're defined talking right so essentially here you say that what is the probability of the first date and then given this state I take an action 80 this is my probability that is the policy and then I take this action 80 I'm in states 3 what is the probability that I move to is t plus 1 and this again I'm invoking the mdp in a Markov property because it does not depend on anything else so this is all the trajectory and then I essentially bring the log in and sum it up and take the derivative and all this it comes up who is that the derivative depends only on the policy that's probably the beauty of this essentially the state dynamics completely disappeared once with policy gradient I really don't need to know what is the Markov property what how the states revolving etc all I need is the gradient of my policy network here and probably that's the whole thing this is essentially the final takeaway of this whole talk is this equation if you remember right which is essentially the gradient of my value function with respect to the parameters theta of the policy is an expectation over the trajectories when I say expectation think of that you're playing a game you've got one episodes and you play game thousand times you get thousand episodes right I'm just going to sum that over thousand times that's how I take the gradient the interesting thing is these three parts what are these parts this is my total reward so I take play the game once I got the reward of one a game to play reward I got a reward of minus one you're waiting by this and this is your standard think of if you see this is your standard loss function log likelihood loss of the softness right so the interpretation of this equation I think you can see it in this one yeah so this again comes to a standard so if I think about in a standard supervised learning setup right I can think of that I have a cnn and I have predicting one or should I go up hold on right I can actually give you a lot of examples I give you this image I say move the paddle up move the paddle down up that would be a standard supervised way of learning this network yeah but the thing what has happening here is that you really don't get supervision at that instance level all that you get is you play the game to the end then you get a reward of either plus one or minus one now I need to figure out how do I have to propagate that reward back to every action I take because right what you are getting is you for example you figuring out there's a long run and then this yeah right okay let me just end with one slide so I have two more slides I'll just end it yeah so the reason why policy gradient is important is that maybe I think we want to learn one method in deep reinforcement learning that would be the policy gradient right we want to read up more and essentially it's actually much simpler to approximate the policy gradient and the choice of policy parameterization is a good way of injecting the prior knowledge there are very stronger convergence guarantees are available for policy gradient over action variables so the next step from more advanced topics if you move next from here is the equation I showed you actually does not work that well in practice yeah the theory is the same but you have to control the variance of that use something called as the baseline estimator to subtract to decrease the variance and that's what you'll find probably papers written on how do I control the variance of the gradient of the value function that's essentially all the literature is about to get better estimators of the gradient of that value function and then there's a next line called active critical algorithms which use both the value function and the policy gradient which will not work I'll end up with two slides right so can reinforcement learning solve all kind of problems now right and that's what I want to go and think about what are the other kinds of setup you think probably rl cannot solve right so this is a one game which is called as the Montezuma revenge where it's not to fail very miserably here if you just summarize everything it's fine but what is happening is that the key part of the game is that the agent has to figure out that there is some key here the key will unlock me to some hidden treasures and I open that I get some reward and humans we when we are playing the game it's very natural for us that key means something right so we really don't explore too much we just directly go to the key but in order to for a machine to figure out or an rl agent to figure out it has to run lots of trial still it figures out that hitting the key is the right thing and that's why this lot of you in this game I think it's very easy not I mean humans can do better than rl in this case yeah I think I'll just end with this slides as a thought for you right you want to think about with all the hype and all these can AI be dangerous right with probably with supervised learning people think it's ridiculous to think of where does AI have any negative consequences but you would think about now with what I told you with rl right so rl can actually be programmed to do something beneficial but look at it in the end all that it cares is that I just want to achieve my goal yeah and you specified some reward you never specified with how do I achieve this goal right it really it can actually figure out that there is a very destructive way of achieving this goal all that cares is that I just want to maximize the reward I can say that my reward would be that goal is that drive to be the airport as fast as possible right you specify it that way that it could probably cause a lot of accidents even before it reaches there right because it really does not have any notion of what is an accident or not all it can see this one track that I have to reach to the airport faster yeah and that's why all these concerns come on how do I actually is just specifying this whole rewards and goals is the right thing to do right how do you bring in other constraint how do you bring fairness how do you do morality and I agent have a guilt etc right and that's essentially I will end the talk with yeah next question thank you very much our next talk is