 So, some of the concepts you saw in the last section, last lecture will be little bit repetitive and then I will introduce OpenAI GIM environment which provides you the standard environments for testing and creating your reinforcement learning agents and how to install that and how to do a simple create simple algorithms using those environments. So, here's the agenda. We will start with quick overview and then we will go on and set up the OpenAI GIM environment and its dependencies. So, it is Python based. So, Python 3.5.3 onwards and NumPy and Matpot dependencies. So, then we will see some environments offered by OpenAI GIM such as Codpole, Frozen Lake and we will do simple algorithms along with that. And at the end, I will show you a demo of a Pong Atari GIM using deep Q learning. So, it is already pre-trained model. So, we will just see it how it is working. Okay. So, to start with it, what is reinforcement learning? It is agent-based learning and where agent interacts with its environment and it gets feedback as a reward and what basically agent tries to do is maximize this reward accumulated reward over the period of time. So, how it is different than other types of learning like supervised and unsupervised learning. In supervised learning, we have labeled data and we try to predict or train on one of the target variables. In unsupervised learning, we don't have labeled data but we try to find the structures in the data. But compared to this, this is different than in two ways. Basically, it is an agent trying to learn the environment with a reward it is getting better. What it can do is take the actions and because of that, we have to start with some trial and error to learn our environment and then agent gets better with accumulated knowledge of the environment. And because agent gets the reward when it takes the action, so it is kind of a delayed reward. So, these are the two distinguishable features of reinforcement learning as compared to other types of learning. So, you have seen this diagram in the last section. So, there is an environment and there is an agent. Agent will take an action at any point of time t on the environment and what it gets is the new state and the reward signal. So, depending upon this information about the environment, the reward and the new state, agent will decide what will be the next action. So, that is why it is called as agent-based learning. So, now what we will do is we will get a bit of a feel of this environment and agent. So, now let us assume you are a reinforcement learner and we have some environment and what action you can take is action one or action two and there are two states possible A. So, two states are possible state A and state B. So, now you will tell me what action you are going to take and let us see what rewards we get. So, you can just some of you can just shout out what action you want to take. So, you do not know anything about the environment right now. This is the first t equal to one start state. So, one or two? One. Okay, let us try one. Okay, you get plus nine. Now what? One. Okay. Plus eight. Okay. One again. Okay. Now what? One again. Okay. See, again pause the reward you get depends on the state you are in and action you are taking. Okay. Okay. Two. Yeah, exactly, but you exactly yes, exactly. So, then what do you need to do? So, with action one exactly what different one you have tried what now? Two. Yeah. Okay. Minus 13. Okay. Now. One. Okay. See, the state is changed. Okay. So, that is some distribution. So, that you don't know that's the whole point. So, as agent you don't know how the rewards are going to be. Okay. Come on. Two. Okay. Now. Yes. So, now. One. So, you need to kind of remember what you did previously, right? You have got a large reward how you arrived at the large reward. Yeah. Okay. Now. Yeah. Now. I think now you kind of got it, right? If you just keep on doing one what you get is reward around between five to 10. And if you do two, you get a negative reward, but the state is now different and state B and now from state B if you take action one, you get a large reward. So, in state A action one and in state B action two is kind of a optimal policy. What we are trying to get it at, right? Because if you just combine 37 minus 10, it's still better than just keep doing one one one, right? Okay. So, when we started, we didn't know what actions to take. So, we take the random actions, right? So, that is called exploration. So, you had to explore your action place space and what reward you're getting. So, that's the environment I have defined. It can be multiple states in different environments. So, it doesn't seem, no, so the total the state space is to A or B. So, if you are in A, if you take action one, you get to go to state B. That's what the environment is. So, where is that? Okay. So, when you explore it, when you choose randomly, that's what called as exploration. You are exploring the environment. And once you know that in state B, if you choose one, you get a large reward and you kind of remembered it, right? So, and you choose it again and again the later time. That is what called as the exploitation of the environment. Because you're exploiting your knowledge about the environment. And there is always, yeah. No, no, it's nothing right or wrong. It's just, yeah, I mean, that's not what you want, right? When you get a negative reward, your total accumulated reward goes down, right? So, no, it's not about right or wrong. It's just the less reward you are getting in that state. You are taking that action. Yeah. So, that's what we figured it out. Optimal policy is in this state, this is the action in this state. This state action is two. In some different setting, different kind of a distribution of the rewards, taking action one in A and again action taking B in even in state B can be a optimal value. Now about the rewards. So, it is what the environment is defined. No, right now this is a simple one. So, it's just a tabular way of defining the rewards. So, no, A is two because that's how you, the state is changed, right? A to B. So, in B, if you want to take one and get the high reward, you have to be in the state B, right? From A, how you will go to the state B? Yeah. Yeah. So, if you keep on doing one in state A, you will end up in having reward between 5 and 10, which is average kind of reward. Yeah. Humility rewards more in this policy. Okay. So, maybe I can answer the questions as you go. Let's move ahead. So, this demo was inspired by the demo given by Dr. Richard Saturn in one of his lectures. So, he used the list programming. I just tried to mimic it in the Python. Okay. So, we'll quickly go over some of the basic concepts in the reinforcement learning and then we will jump to the open AI and its installation and implementation. So, policy is the way the learning agent behaves at a given time and in a given situation. And it's basically, it's mapping between the state and action. So, what states are there and what actions you will take. A reward signal is what the... It's a scalar number returned by an environment and that's what how agent learns or agent knows that whatever action is taken, how good or bad it is. So, as I said, the goal is to maximize the accumulated reward over the period of time. So, value function is associated with action, how good that action is and how good that action is going to give me the accumulated, total accumulated reward. Model is basically a descriptor of environment, how environment is working and with model, agent can basically infer how environment is going to behave in different actions in different states. Okay. So, now what we saw in the demo or the little interactive demo, there was an environment, some environment which you didn't know how it is going to behave in terms of rewards and actions and there was a problem, reinforcement learning problem where we want to maximize the total accumulated reward. Now, if you want to design some algorithms or test some algorithms, environment plays an important role. Say, suppose you design one algorithm and you tested with, say, customized environment, you build yourself and some other guy build another algorithm, it's very difficult to compare or, you know, test those two algorithms if the environments are different. So, to solve this problem, open AI, what they did is they took the different environments from the literature and they created this library where the standardized environments you can use for training and designing your agents. The advantage being you can compare these two agents or the two algorithms because the environment is standardized. And so, what does open AI do? It's a toolkit for developing and comparing reinforcement learning algorithms. Why you can compare? Because the environments are standardized and that's why you can compare two algorithms developed by two different people on the same environment. The library gym is nothing but a collection of environments and we will see what are different types and, you know, what are different environments in this library. So, this is the URL. We'll go to that URL and see the different environments. But what problems it tries to solve is it gives you a better benchmark and it standardizes the environment. And because of this, there are many advancement possible where people come and develop their environments against this standardized, develop their algorithms against this standardized environments. So, this is the, let me, is it visible at the back? Okay. So, these are the environment, classic environments. So, as you can see, this is the cart pole. So, you have a cart and a vertical pole. So, this is the environment described by Dr. Sutton in his book and they have encoded it here. So, what action you can take is you can move the cart to left or right and the aim is that pole should be stand vertical. At any point of time, if that pole tilts more than some angle, it is bound to fall and episode will end. So, and then there are few environment parameters like wind velocity and the angle of the pole from the vertical line. Those are the parameters what you will get. So, we'll see more in the later slide. Likewise, there are, there's a mountain car where you have to reach at the top and the action you can have is move the car in different directions. There are some text-based environments. So, this is the grid environment, 4 by 4 grid. There is a start state, there is an end state and there are some of states where h means there is a hole and you will fall and episode will end. The target is to start from a start state and reach to the goal state. That's when the episode will successfully end. For each step taken, you will get reward 1 and when you reach the goal state, the episode will end with a success. There are some Atari environments in the last session you saw and all these environments you can code against, develop against and so that's what you get when you go with this gym library. So, whoever has laptop and want to start, let's clone this repository. It's my repository. What I have there is a template code. So, you don't have to write entire code. That's not feasible in 90 minutes. So, I have developed the template codes and what you will write is just the important pieces of the equations. So, clone that and there is a requirements.txt where it specifies the dependencies you need. There are not much dependencies. So, let me show you quickly. So, for multi-arm bandits, the multi-arm bandits environments are not included in the gym. So, there are separately defined environments. So, there is a Git repository link there. So, you have to clone it and just install it. Second one is also needed. Both are needed. So, if you have a Windows 64 machine, I have a pen drive there. You can copy it from there. You don't need to install your environments. But for other operating systems, okay, is it visible now? Visible now? Yeah. It's a github.com slash saura numeric one. Deshpande, my first name and last name, slash odse-2018.git. There is one more pen drive I have if you want. So, there are already two. So, can you please pass? Okay. You can just pass back. From pen drive, don't copy everything. It's bit too heavy. It's 8 GB. Not everything is required. There are installers also there for Anaconda if you want it. But just if you have Windows 64, just copy the Anaconda 3 folder. That is for Windows 64 and the repository, okay? Yeah. So, if you are done with the pen drive, can you please pass it in front? I have three. I have given all. Yeah. So, the internet is not there. So, you can use your phone internet if you want to install or use APIs. Yeah. So, it will come to you just wait. So, people are just copying it. If you want to see all the environments in the OpenAI, what all the environments are supported. First, you need to import ENBs. So, that is the class from Jim. And then you can print. There is a registry object and you can print all the environments. Otherwise, you can always refer to the Jim wiki or there is a page like this for most of the environments where the environments are described. So, what they describe is what that environment is, how that environment give out the reward and what does it mean by solving this environment. Okay. So, in this case, cut pole one, if you get an average reward of 195 over 100 consecutive trials, then you can say that your algorithm has solved this environment and you can also post your scores here and that's how you can compare your algorithm with the one which has already been developed. So, you start an environment. So, what is the environment? You have a cut pole and a vertical bar. You have a cut and a vertical pole. When you start an environment, it will give you some state in which state that pole is and you will move the cut either to the left or right. Okay. And depending upon your action, the port will tilt to either ways. If it tilts more than some angle, it will fall down and episode will end. Okay. For more time steps you survive, you keep the pole vertical, the more will be your rewards. Okay. You will not get it in one episode, right? In one trial. You have to do multiple trials, accumulate your rewards, learn the environment, okay, via some algorithm and then this accumulation of the rewards over the trials goes above 195, 400 consecutive trials. You can say that your algorithm has solved this cut pole. Okay. So, everybody has installed or copied. No. So, any... No, no, no. This is for your reference. So, that is from the gym code what the environment looks like. So, there are four parameters in this environment. What is the cut position and the cut velocity, the pole angle and the pole velocity at the tip. So, all these four parameters will affect your position of the pole and when you move the cart you can either push the cart to the left or right. These are the two actions you can take in this environment. That is your action space. Okay. And depending upon your action the pole will tilt. If pole tilts more than say I think some 12 degrees or something it will fail and that episode will end. Okay. Yeah. That is one episode. So, now our goal here is to somehow figure it out the action and the... which will result in the parameters so that you know pole stands vertical. Before going to the discard pole environment what we will do is we will just try to code a very simple loop where we will initialize the environment we will take a random action and we will render it. Okay. So that you will get an idea how you can do the basic loop for the agent. Okay. Start with the basic... Start with basic agent loop.py in the assignments folder. There is no code in it but you will write a basic simple loop in the main class and what you will write is this is the pseudo code and corresponding GIM APIs. Okay. So, so what we are going to do is we will import the GIM and we will start the card pole environment. We will loop over for some number. We will do some number of episodes. When you start any episode how do you start it? You have to reset that you have to call the reset function new episode started. Okay. And then once the episode is started we will go and in a loop we will take the action left or right and once then the episode is finished there is a point where the pole will tilt by some angle and episode will finish then you will break it out and then we will do some print so you need to save total number of steps you have taken and for all the episodes and the number of episodes that will give us some number how many number of average number of steps we took across the episodes. Okay. So on the right hand side for your help I have given the APIs. So how do you import Jim? Import Jim is the call then you have to create an environment you have to do a Mac call and the name of the environment name of the environment you can get it from the environment page exact name to start the episode it's the dot reset. So so you have to create a session of that environment right cartpole v0 or whatever is there so we will use cartpole environment and name so I will tell you the name so you have to pass a string like this cartpole iPhone v1 so that is the name so v1 represent the version so if they change something in the environment they version it using this v0 v1 thing so right now for this what we will do is we don't want so the purpose is to just get familiar with the agent loop so environment dot action space is what will give you all the actions in this case there are two actions and dot sample will give you a random action out of it so we will just use that random action and then how do you take the action on the environment there is a step function and just pass that action to it and if you want to render the cartpole how it is behaving the actual visualization you have to call dot render and when you are done with all the loop just make a dot close call on the environment so that it will close down so those are the numbers I have shown so these are the parameters the position of the cart the velocity and the velocity at the time so these are the things which describe this environment and this will decide the position of the pole so I will show you what it returns in favor it is a part of that return not the only thing it returns so right now just it will change but for now just choose the random action so the purpose is just to get familiar with the structure so let me show you what it returns so it returns these 4 numbers so it returns these 4 numbers new state which are those 4 parameters reward done if it is done it will be true and info is just the additional information for debugging or other purpose so that is the environment that is what it gives you so this is what you will get when you take an action with dot step you take an action and you get these 4 things as a return and which you can use it further now change it to take a more more intelligent or different way so in the next one we will change it using a random state or hill coin kind of algorithm we will change that in the next part yes and dot action space is all the actions 2 in this case and dot sample is the random one so here is how one way to do it the first you import import the gem then you can so make is the call where you initialize the environment we will record total number of steps here and we will run it for say 50 episodes and then we will loop over the number of episodes now first thing we need to do is reset that episode so we will do the reset call so is the font visible at the back so then we will loop it until the episode is done so how do we know episode is done see this so we get this action which is a random action for now then we take that action on the environment and which returns the new state so do you remember that diagram so you agent takes an action on the environment what it returns is new state and reward so here it is so new state reward and the episode is done if the poll is fallen then it will be true and the info info is just debugging or other information threshold is defined by the environment and environment is provided by the gem that's where the standardization comes because if I code it for this environment I will say some different number where the poll falls you will say some different number and our results we cannot compare right no there is a way to create a custom environment and register it but you cannot change the yeah of course it is open source you can go and change but that's not the point okay so the render will actually render in visualization of the card and when done is true we will just append the steps how many steps we have taken will break it out and the last call is close let me run it so that you will see how it works so here it is showing and so it's so when it falls it starts over and you can see that you know poll is going the way around and the action it is sorry action being taken okay yeah so it's just I mean you have to interpretation so this is the basic structure in which you can work and you can implement different strategies for selecting the actions and you know maximize the reward and those kind of things no it's no so in the assignment part you have to code it and there is in the code for a part of it already coded example is there so if you don't want to code or just want to run it you can select from there the file name is basic agent loop okay so in the interest of time let's go to the next one next one you don't have to code everything there is a template code available so now what we did in the basic agent loop is we selected the action randomly which is not very good or not very efficient now what we will do is we will apply two simple strategies to select an action so that we get a better reward or we get a better cumulative reward over our number of episodes so first one is the random search approach so instead of randomly selecting the action what we will do is the parameters which we are getting back from the environment we will apply some random weight so in somehow what you need to do is the parameters you are getting back from the environment you have to convert them to action right so depending upon the parameters somehow you have to convert them to action so the first simple strategy what we will do is we will have four numbers four random numbers which are mapping to the four observations written by the environment we will take a dot product of it and if that dot product is say greater than zero we will go to the right or if it is less than zero we will go to the left so the weights we what we are going to choose for converting this observations to the action we will choose randomly that's why it is called as a random search and in this diagram what it depicts is you get the random weights and there will be there will be a weight which gives us the maximized reward and that is what we will choose so that's what we will do in this random search so yes yes so over the period of time yes next state yes yes so the cart pole environment if you take an action it gives back you four numbers right now what I will do is I will randomly choose four numbers as weights I will do a dot product okay and you will do a dot product you get a single number right and if that number is less than zero I will say I will move the cart to the left that number is greater than one I will say move the cart to the right for episode one episode next episode so I will show you the quickly the code so you will understand that do okay so is the same kind of structure you make that environment you create the environment there are episodes there are more things we are trying to record like best reward and the best parameters and then we will start will run it for number of episodes now in this function run episode function so this function will run one episode will first reset it and when I am calling this run episode I am passing the parameters okay and how I am selecting the parameters I am selecting it randomly okay so for that episode those will be my parameters and those are the weights basically and what I will do is in the episode every time those will be fixed for that episode and I will do a dot product of the observations I am getting from the environment each time and the parameters and I will select the action accordingly okay now what you have to do here is implement this random search which is nothing but you have to return the random array of four numbers which are between minus one and one okay and we will improve it little bit in the next hill climb where you will add some noise cancellation so that you improve that randomness bit by bit which one no that should be minus one to one because the observation which is coming is in that range yes so there is a if you are aware of numpy there is a random package in that which we can use yes but random action is completely random you select it randomly but here at least the weights are random and we are mapping that to the parameters you are actually getting from the environment and on that you are selecting an action so the observations you get ideally the observations you get from the environment you have to map it to the action right so yes so so let's wait for one or two minutes if you want to try it out otherwise yes yes there is there are ways you can also do the gradient design distribution of what yes so that is I think normal caution from numpy random yeah I mean you can try it out yes that is the next part actually not very specific to this but depending upon the environment you can try it out which works for best for that environment okay so so this is how you can generate the random weights with so hill climbing it's little bit improvement over the randomly choosing parameters you add a bit of a noise noise multiplier and with that you return the functions with that you choose the parameters so so instead of just randomly choosing the parameters you apply this noise factor and if the the reward which you are getting is better then you keep those parameters and you start with that so right now we are just discarding the old parameters right but here in this case we are preserving the parameters which we are which we are getting the better results and then yeah yeah yes so only the difference is you use this noise canceling parameter along with the random numbers to return the weights and rest of the things are same we will do the dot product and choose the action no so noise is just way to reduce the randomness of the weights we are choosing so instead of just completely random parameters we use this noise to go bit by bit towards the optimal yeah okay so I will give couple of more minutes if you want to try it out so you can also check the code in the code section if you want to get an idea so for gradient descent yeah so maybe we can take that offline because it is a different approach no I mean so gym is just that it describes just the environment and the depending upon the action it gives the reward so how it decides what rewards to give I think it depends on environment to environment so tweak the gym the limitations of the environment you can register your custom environments in the gym and use it so to use Adam or any other optimizer you have to do the gradient descent and those sort of structures like to create basically calculate the loss and do the calculation based on the loss right so can you elaborate your question maybe so reward is what is written by the environment which we don't know fully what is the how the rewards are going to get for the action that's where you need to explore the action and learn the distribution of the reward so basically that is the action value distribution so when you keep on doing this over the period of time say suppose you keep on taking the actions which either gives you better rewards in near time or the long time it converges to the policy which is optimum over the period of time with multiple actions or multiple episodes so which algorithm so see reinforcement learning is a structure or the way the problem is defined there is no particular algorithm called reinforcement learning so it's a agent based learning where you create these kind of strategies to maximize the rewards maybe we can take it off line if you want to so the next learning problem is multi arm bandits so as we have also seen in the previous section these are basically called multi arm bandits because of the there is a slot machine and you have you pull the arm and you get the reward and instead of one arm you have multiple arms and multiple arms can give you each arm can give you different rewards and it has different distribution of the rewards which you don't know and the objective is to given this kind of multi armed bandits environment we want to maximize our rewards over the period of time so just to put it more formally in the mathematical equation at any point at any time t if you take an action a the expected reward is q star a and that is the expected reward and when you run your episodes what you get is the qt a is the value of any action at given point of time so we want to we would like to have the q at time t for action a as close as possible for the q star a so now when we deal with this multi armed bandit we have multiple actions to choose from so each arm is one action so you can choose first arm and get some reward you have second arm you will get some reward and then as we saw you can just keep on pulling the arm which is giving you some reward that will be called as exploitation because we know what reward it gives and there will always some arms which we have not tried which also might give us the better reward so that is the exploration part of it so we have to balance between this exploitation and exploration so that is the dilemma we have and the reward which we are going to get over the period of time we have to formalize it in some way we have record of those rewards so one way is to have the average rewards we got over the period of time for a given arm so how we can how you can calculate it is the sum of rewards you have got for that arm and the number of times that arm has been taken or that action has been taken so that is the average reward per arm and now when you want there is a time to select the action the simplest thing you can do is whoever has given you the maximum reward till date you can select that action select that arm so that will be the greedy part of it so you know that out of 10 this third one has given me the average reward maximum average reward so far so you can keep on doing the three but in a while you should also try to explore what other arms are giving that way we will have the we can maximize the accumulated reward over the time each arm is an action so you have to select it and you will get a reward so agent is what it will select the arm and it will try to maximize the reward so if agent knows that agent will have the information it will calculate the average reward per arm and it will keep on selecting it if it wants to do greedy or if it wants to explore the other arms it will go and select that action yes so average so how we are calculating so so suppose you have five arms okay so or initially you need to start randomly and you will know that first if you pull the first reward it gives such and such if you pull the first take the first action or select the first you will get some reward second third and you will store them like this so for each arm what is the mean reward you have got what will be that the sum of the rewards over the period of time you have got and number of time divided by number of time that arm is selected right and when next time you are going to choose the action what is the simplest way of doing it see the max of it so which arm has given you the maximum mean reward and selected so that's the greedy part of it that is from our previous knowledge but if you keep on doing this you will get the same maximum reward but it is also possible that other arms are giving more rewards so you have to choose them at some point of time and you have to balance between these two but you don't know what other arms are giving you right it might be given this so so for this bandits this is this is what you have so the state is you choose the arm and what reward you got got that is what the state yes so each arm what rewards it returns has some probability distribution which you don't know that's you need to choose the action right you can go for that arm always or you can now how do you balance between you want to exploit your knowledge exploitation or exploration so what you can do is you can have parameter say depending on which you can either choose sometime exploration so sometimes you choose the arms which you haven't explored and most of the times you choose from what you have so maximum rewards what you have that way what will happen is you will also explore your environment and you can also take an advantage of the earlier knowledge of what we have the average reward we have stored for arm okay so this kind of action selection strategies are called equity action selection so epsilon greedy so epsilon is a parameter you can set say for example I say that the point five and then what I will do is I will randomly generate a number so randomly number generate a number from zero to one and I will compare that with a point five so say that what will happen because of that for half of the time I will exploit my knowledge of the actually a learned environment or the arms the rewards and for rest of the times I will go and take the choose the arms which I haven't chosen so that way this parameter will control the exploration versus exploitation we want to do so you can change the epsilon and try see the how can you get the rewards yep the epsilon is set and you can try out with different epsilon values and how the rewards what total rewards you are getting okay so so this is the template for multi arm bandit so what we are doing is here so we are calculating the mean reward mean reward per bandit or per arm and we are remembering that and so when we select an action what we do is so we set this epsilon for our experiment and we generate a random number and if that random number is less than the epsilon we will select the random action or otherwise we will select the action depending upon the average rewards per arm what we are going to explore so far okay so that way this epsilon parameter will control the exploration versus exploitation so similarly so there is another way of choosing this action which is called as upper confidence bound so it is a different equation so here what we are doing is we are adding more parameter which is kind of controls the the square root term you have is on the top you have the time step the logarithm of the time step so at what time so if it is the 15th time step that will be the logarithm of 15 and the denominator is number of time that action is taken and what happens because of this is say if I am taking an action which is already been taken the numerator is going to increase because it is the next time step but the denominator remains the same because we have already explored that action and because we are calculating our action depending upon this kind of calculation it sets the upper bound with which we can do the exploration or exploitation so we can compare this with e greedy and let's see what happens so what I will do is I will run the program I will run the code which has implemented both of this actions in one way and we will plot it on the one graph so before that I will quickly show you how you can implement this equation in the given in the slides so this is the implementation of that equation we have already recorded the mean rewards per bandit and this is the parameter we have set and then you can return the maximum the arm giving you the maximum reward by this equation so in a loop you can update this state and action table continuously with this equation choose the action depending on that so let's try to run so there are 1000 episodes I have taken and for each episode around 850 times we will do the iteration to get the average rewards so you can see that so the blue one is using upper confidence bound and the red one is using eGridD strategy so initially the eGridD so the plot is for the average rewards what we have got over the period of time for these episodes and you can see that for eGridD it is faring better initially so the UCB kind of strategy takes over after some episodes and then it gives you the better rewards over the period of time so the numerator is time step denominator is number of times that action is selected so and the parameter C is controls that degree of again exploration versus exploitation but this time it gives more emphasis on the actions depending upon the what actions you have previously selected so in the past implementation we are just randomly selecting an action right if the epsilon parameter is less than what we have set for you are randomly selecting from the environment or we are selecting from what we have in our memory for this action what the mean reward is but here the bit improvement is instead of randomly selecting from the environment we are just trying to see what actions has a more probability or more possibility of giving you the better rewards instead of just selecting randomly so in this case when there is a time to choose the actions randomly we just choose it randomly without even bothering which action has a more potential of giving us the better rewards yeah so if you go in the literature if you see the derivation of the equation there are things I mean because of that there is a it fairs better than the e greedy in the long run okay fine just to now close off what we will see the demo of deep Q learning so there is an environment called as Pong Atari Pong game and so let's run the Pong game without any intelligence by selecting the action we will get an idea of what the environment is so this is the game and when right hand side is our agent and the left hand side is the environment so whenever there is a miss in the when the ball is missed by that back other part gets the reward so right now it is not doing much or our reward is kind of zero yes this was the random now I will run the code which is trained using deep Q learning network and let's see how it fairs against so this model is already pre-trained and I am using that model for running the game yeah I mean so that randomness is I think by the factor c in the formula yeah but you can change it and see but at some point of time because there is a randomness in the formula you will end up in exploring the other action it's not like it will always give you the same action it's just that so in e greedy what you did is randomly selected the actions here what we are trying to do is at least trying to select an action which might give us the reward in the long term so even if this is in that case in some cases the second action will get selected and it will get into the it will become a greedy action yes so it depends on the domain and the environment so before applying this this is just for what the gym environments gives so thank you very much for attending the workshop if you have any questions if you need any help with the code please contact me I am on linkedin and you can get connected with me and you have the repository with you you can go and clone it and play with it thank you