 Hello everyone and welcome to another episode of Code Emporium where we're going to talk about Q learning, arguably one of the most popular concepts in reinforcement learning. So let's get to it. Now throughout the series on reinforcement learning, which you can check the playlist down in the description below, we have discussed that machines can learn in primarily three different ways, which are known as the machine learning paradigms. So the first one is supervised learning. That is we have a dumb model, we have data, we have a label, we use that to train said dumb model in order to recognize patterns between the data and the label. And this is primarily used in classification and regression type problems. The second is unsupervised learning, where we have data, but we don't have a label and the primary objective is to understand patterns within the data. So this is used in clustering and dimensionality reduction, among others. And then the third pillar is reinforcement learning from the textbook. Reinforcement learning is learning what to do that is how to map situations to actions so as to maximize a numerical reward signal. Now we can break reinforcement learning algorithms down into two types. That is the value based methods and the policy based methods. Remember at the end of the day, we're trying to maximize total reward. So value based methods will determine a value function that quantifies this total reward. And using this value function, it'll determine the optimal policy. Policy based methods will determine an optimal policy directly. And optimal policy is the policy that maximizes the total reward. And just to note that policy is how an agent behaves in a given situation or state. Now Q learning is a value based reinforcement learning method to solve problems. And so we can dive a little bit more into these value based methods. So value based methods determine a value function that in turn determines a policy that's going to maximize total rewards. A value function is a function. So it has some inputs and generates an output. And depending on the inputs, there are two types of value functions. We have state value functions represented by V. And then we have state action functions represented by Q. But now states actions, what are those state is a snapshot of the environment and an action is the decision taken by an agent in an environment. So the state value function will take a state as input and output a real number, whereas the state action value function will take the state and action as input, and it'll output a real number. So this real number is also known as a Q value, ooh Q. So we're getting close to Q learning here. Now the state value that is VFS will quantify how good is it to be in a given state S. Whereas the state action value or Q value will quantify how good is it to be in a state S and then take an action A in this state. For Q learning, we are interested in learning the state action value function. Because it's a function, you can think of it more easily as a table of rows being states and columns being possible actions. And each cell value will actually be the Q value for that given state and action. And the goal of Q learning is to effectively learn these Q values such that the total reward is maximized. So let's see how it works with this grid world. So this here is a fully observable environment with nine squares and there's a plus 10 reward square a minus 10 poison square and other rewards is negative one for any other square. Now the goal of our agent is to get to this plus 10 reward spot in the best possible way. And more technically we want the agent to learn an optimal policy. This is known as the target policy. And let's say that in order to learn this optimal policy, we want to use Q learning to do this. And so we need to make use of this Q table. Let's initialize these values to arbitrary values. Now note that these values could also be loaded by some other agent that explored the environment previously. But for now let's just keep them arbitrary. Now let's say that the agent starts in the first cell at state s one and it takes an action based on exploration policy. This means that the actions that an agent takes is simply based on random chance and the agent won't just choose a state because it has a high Q value for example. And this policy is known as the behavior policy. And it can be just about anything. If the agent is in a state s one, it can take an action of either going right or going down. And because the policy is random, let's say that the agent made the decision to go right. So the action is taken. And then when taking the action, the agent transitions into another state. And let's say that this another state is s two. And on doing so it also receives a reward of negative one. Now let's calculate the observed Q value. So this is given by the Bellman equation. Now the Bellman equation defines a recursive relationship between Q values. And for more information, you can check out my video on the topic. But for now the Q value for this state s one and right is given by the sum of the reward of state s two plus the maximum future Q value for that state s two that we can get. And gamma here is a discount factor just to show how much we want to value the current reward over future rewards. And so if we just plug in some values, so the reward from s two is negative one. Let's say that gamma that discount factor is 0.1. And then we multiply that by the maximum possible Q value that we can get from state s two. Now looking at our Q table, we can see that we get the maximum Q value by going down from the state s two, which is 1.5. So substituting 1.5 in our equation, we'll get the overall Q value as negative 0.85. This is the observed value for Q one state s one and going right. But the value at the table for this specific Q value is one. There's clearly a difference and this difference is an error that is known as a temporal difference error. It's called the temporal difference error because we are comparing Q values of two different time steps and their difference is the error, hence the name. For more information on this topic, I created another video right here. But for now, temporal difference error is now going to be the observed value minus the expected value, which will turn out to be negative 1.85. We now update the Q value in the table based on the formula that looks like a gradient update rule. And here, alpha is going to be the step size with a learning rate, so to speak. And it defines how much in every time step are we willing to change these Q values, higher the value, faster the learning because bigger the updates. So in this case, let's also just take it to be some 0.1 and plugging in these values, we get one, which is the expected value plus 0.1 times the error, which is negative 1.85. And then doing the math, you will get 0.815 as the final result. And so we update this single Q value in the table from one to 0.85. And this is the end of our first time step. Let's now walk through another time step just to make sure that we understand what's going on. So let's say that now we're in a state S2 in time step two, and we can take the action of either going right, left, or down. And the policy we are following is now random, so let's say that we choose to go down. We take the action, and then on taking the action, the agent will transition into another state. Let's say it's S6, and it receives a reward of negative 1. Now let's calculate the observed Q value that is given by the Bellman equation. So the Q value observed is going to be the reward in transitioning into state S6 plus the discounted maximum value that we can obtain for state X6. And so we know that the reward for state S6 is negative 1. And from our table, when we are in state S6, the maximum possible Q value is going to be 1.7. And this is the value if we choose to go in the downward direction from S6. So we substitute that in the equation, and we get negative 0.83. Now this is the observed value, but the expected value that is stored in the table for this cell is 1.5. And so the temporal difference there is going to be the difference of the observed minus the expected. Negative 0.83 minus 1.5, and we'll get negative 2.33. And now we're going to now plug this value into our update rule for the temporal difference there. And so this gives us a 1.5 for the current Q value expected plus 0.1 times the temporal difference error of negative 2.33. During the math, you get 1.267. And so we update the single Q value in our table. And this is the end of our second time step. We repeat this until the sequence of steps is over. That is until we get to the plus 10 spot or the negative 10 spot. And this sequence of steps until now is known as one episode. We can then perform multiple episodes over and over again, choosing random actions and effectively just learning the values of the Q table until they become more stable. And effectively, these Q values are learned. And once they are learned, an agent can take an action based on which will give us the highest Q value result. And so the Q value will dictate the policy to get our reward. And this new policy, remember, was the target policy that we were trying to achieve. And note that this is different from the behavior policy that we use to explore the environment in order to learn effectively the target policy later on. Because we can decouple this behavior policy for collecting data from the target policy, which is used when we have an optimal Q table, Q learning is known as a type of off policy algorithm. And so I hope Q learning makes sense, at least at this surface level. That's all I have for today. Thank you all so much for watching. And if you like the video and you think I deserve it, please give this video a like and subscribe for more amazing content. We're at 100,000 subscribers. We'd love to get to 150,000 subscribers real soon, but thank you all so much. And I will see you in another one. Bye bye.