 Aliens are real, and it's your job to discover what kind of life do they lead. Would you rather integrate yourself into their society to answer this question, or would you rather observe them from a distant planet? Trust me, this is related to on-policy versus off-policy strategies, and hopefully by the end of the video, you'll see why. We're going to divide this video into three passes where we talk about policy, what is on-policy, what's off-policy, and then get some details as we go further into the video. Pay attention because I'm going to quiz you. Let's get started. So reinforcement learning is one of the three major ways in which machines can learn. And specifically, reinforcement learning is learning what to do, that is how to map situations to actions, so as to maximize a numerical reward. Now, a function that maps a situation, which is a state, to an action is known as a policy, and it's these policies that are used to maximize a numerical reward. And so reinforcement learning is essentially learning a policy to maximize a numerical reward. In order to learn this policy, reinforcement learning algorithms are used, and Q-learning is one such popular off-policy reinforcement learning algorithm. So what makes Q-learning off-policy? Off-policy reinforcement learning algorithms make a crucial difference between two types of policies. So the first is the policy that's used to take action and actually navigate the environment. The policy followed is known as a behavior policy. And the second policy is the policy that is used to actually learn. And in the Q-learning case, it's the policy that is used to update the Q-values in a Q-table, for example. And this is a target policy. Now for off-policy reinforcement learning algorithms, the behavior policy can be different from the target policy. Well, for the on-policy algorithms, these two policies need to be the same. And this is a very critical difference between on-policy and off-policy algorithms. So that's going to end pass one. But for more details now, let's dive into Q-learning specifically and also discuss why Q-learning is an off-policy algorithm in more detail. So let's take a fully observable grid world of nine squares. So you can see here that there's a plus 10 square, a minus 10 square, and everything else is negative one. And the goal for Asian is to start in the top left corner and to get to the plus 10 reward making the best possible set of actions. And so we want the agent to learn an optimal policy. And we say that we want to use Q-learning to learn this optimal policy. And to use Q-learning, we need to use a Q-table that will help us dictate actions. And so this Q-table will have a set of rows that are states, a set of columns that are actions, and each of these are randomly initialized. Now to begin the learning process, the agent actually has to navigate the environment using a behavior policy. And let's say that this behavior policy is that of random exploration. That means that at any given state, whatever the state that this agent is in, it will just generate some random action. So let's say that the agent is in state S1. And from here, it can move either right or down. Because the behavior policy is random, let's say that the agent randomly chose to go right. So we take the action, and we receive the negative one reward. And now we are in a new state, and let's call it S2. Now let's calculate the observed Q-value. This is given by the Bellman equation. And the Bellman equation, by the way, defines a recursive relationship between Q-values. And for more information, I would recommend checking out my video on this topic. So the total reward observed is the reward that we receive in transitioning into state S2. Plus the maximum possible reward that we can receive when being in a state of S2. Gamma here is a discount factor that describes how much importance we want to give to future rewards as opposed to immediate rewards. It's a number between 0 and 1. Now, in order to determine this second term, we need to look at the maximum Q-value from state S2. So we go to our table, and then we see that the largest value is taking S2 down. That's 1.5. And so we can use this in our equation that is negative 1 plus, let's say the discount factor 0.1 times 1.5. And doing the math, you get negative 0.85. But the value now stored in the Q-table for S1 right is 1. And the difference of this error is known as the temporal difference error. I have a full video on this topic for more details. But essentially plugging the values in, you get negative 0.85 as observed minus the expected value, which is 1. You get negative 1.85 as that error. And now we update this value in the Q-table based on a formula that looks like a gradient update rule, where alpha here is going to be a step size or learning rate. And we'll keep it something like, again, 0.1. The first Q-value is going to be 1 plus alpha 0.1 times the temporal difference error, which is negative 1.85. And doing the math, we get 0.815 as the new Q-value. And so we update the single Q-value in our table. And this is the end of our first time step. Now we can continue to do other time steps like this until the agent hits plus 10 or minus 10, and update the Q-values accordingly for every time step. And this will be the end of one episode. And you keep on executing multiple episodes like this until the Q-values in the table become more stable. Now for a complete breakdown of Q-learning, check out the full video. But for now, let's actually backtrack to the beginning towards that initial Bellman equation. In state S1, we took an action of right according to the behavior policy that was random. But we updated the Q-value of state S1 based on the target policy, which was greedy. Now looking at this Bellman equation, the next action chosen is simply the one that gives us the largest Q-value. And this is the one that involved choosing the downward action. Note that we did not take the action going down because that down was chosen by a target policy, not the behavior policy which is used to traverse the environment. So overall, typically for off-policy algorithms, the target policy is greedy. We're trying to find the action that maximizes the current estimation of Q-values, for example. Whereas the behavior policy can be random, it could be epsilon greedy, or it could be greedy itself or anything. But it's important to note that these two policies can be decoupled from each other. And so for example, a behavior policy can explore the environment and collect data. And the target policy, which is used to update Q-values, can be done separately from that behavior policy if navigating the environment. And because they can be effectively decoupled, this is an off-policy reinforcement learning algorithm. Quiz time! Have you been paying attention? Let's quiz you to find out. In the scenario described, what if the target policy was set to a random exploration instead of greedy? Would the agent not learn at all? Or does the agent learn quickly? Or does the agent learn slowly? Let me know in the comments down below and bonus points for any reasoning. That completes this segment of quiz time, but I'll be back so keep listening. This is the end of pass two, and I hope we now understood a brief understanding of Q-learning as well as what makes Q-learning an off-policy algorithm. Now let's move on to pass three, where we're actually going to introduce on-policy into the mix, and we're going to actually compare their algorithms here one by one. Now this algorithm here is the same off-policy strategy that we just discussed, so let's just walk through this really quick though. So first of all, algorithm parameters with step size, that is the gradient update here, is going to be a number between zero and one. Now we initialize all the Q-values in the Q table, and now for the first episode, let's say that we were in state S1, and for the first step, we will choose an action A. Now this action in the next step is going to be taken, because it's actually going to be taken, that means that the agent uses it to navigate the environment, so that means this action should be chosen according to a behavior policy. Now the behavior policy that we chose in our example is a random policy, so we just choose a random action, whatever the state S is. But here, another example that we could have chosen was epsilon greedy, so we now take this action, and then when we take an action, we will observe a reward, and then transition into a new state, this could have been S2 for example. And this equation over here is just the combination of all three equations that we just talked about, just the Bellman equation from this part, and then the temporal difference error for here, and then the gradient update rule equation that we discussed too. Now you could see here that we have like another action that needs to be chosen over here, but this action is chosen to update the Q table value. That means that this action needs to be chosen based on a target policy, and the action that is chosen has to be the maximum Q value we can get from a given state, which means that it has to be greedy. In our case, it was S2 and down. Note that this action was just chosen, but it was actually never taken. And this is kind of what allows us to decouple the target policy, which is greedy, from the behavior policy, which could have been anything, like we mentioned before. Now we transition to the next state, and then we repeat the process again and again until the end. So that was the off policy setting. Now let's try to see it from a bird's eye view of how it's different from this on policy algorithm. So the on policy algorithm we're going to talk about is known as SARSA. And it starts out very similarly, where we define a step size and an epsilon term. Now we start with choosing an action A depending on a policy, but note that this action A is immediately taken. So if we take the action, that means that this action A has to be chosen based on a behavior policy, whatever this behavior policy may be. Now we're taking the action A, we would have transitioned into another state S dash and received a reward R. Now we're in this new state. Now we can choose another action, a dash from the state, but based on another policy. And this policy you can see is being used right over here. And it's being used to update the Q value itself. That means that this action that we're taking needs to come from a target policy. Now what's really important is this next line where we assign a dash to A and then we take the action A. That means effectively whatever we chose here in the previous form of that loop, we're actually going to eventually take that same action. So even though a dash was chosen as a part of the target policy to update the Q values, we're taking the action, which means that it should also be a part of the behavior policy. And this is why the behavior policy and the target policy must match each other because the policy that we're using to explore the environment actually is the same policy that we're using to learn these Q values. And hence Sarsa, which is the algorithm that we just discussed, is an on policy algorithm. Quiz time. Have you been paying attention? Let's quiz you to find out. How does Sarsa update its Q values? It updates its Q values based on the maximum expected future rewards. It updates Q values based on the current policy and the actual actions taken. It updates Q values by choosing the action with the highest Q value for a given state. Or it updates Q values by always selecting the optimal action. What do you think? Leave your answer down in the comments below and we'll have a discussion. That's going to end quiz time for now, but let's summarize the key points in this video before we leave. So first, a policy is a function that maps a state to an action. Q learning is a reinforcement learning algorithm that learns a policy that maximizes total reward. And the policies that are the main characters here are twofold. The first is the behavior policy, which is a policy that is used to take the actions in the environment. And then we have a target policy, which is the policy used for optimal decision making. In this case, it's used to update Q values off policy reinforcement learning algorithms can have a different behavior policy and a target policy. And thus they can be used to decouple data collection and training on policy algorithms like SARSA use the same policy for the behavior policy as well as the target policy. And so the agent takes actions and learns using the same policy. And that's all I have for you today. I hope all of this made sense. You now know the difference between on policy and off policy and what algorithms are on policy and off policy and what makes them so. So thank you all so much for watching. If you like the video and you think I do deserve it, please give this video a like. We're at 100,000 subscribers, would love to get to 150,000 subscribers with your help. So please hit that subscribe button, ring that bell for notifications. And thank you once again, and I will see you in another one. Bye-bye.