 Hello everyone and welcome to another episode of Code Emporium where we're going to talk about temporal difference learning. It's a very important concept in reinforcement learning. So let's get started. Let's start with the definition. Temporal difference learning is a method that value-based reinforcement learning algorithms like Q learning use to iteratively learn state value functions or state action value functions. Now every word in this definition is super strange. So let's back up temporal difference learning is a method to learn. Okay, that's a little easier. Now we're going to add some more complexity to it. Temporal difference learning is a method reinforcement learning algorithms use to learn. So what are reinforcement learning algorithms? Reinforcement learning is one of the three machine learning paradigms that machines use to learn. The first is supervised learning where we have some data and we have a label and we use that to train a model to map this data to the label. Then that's typically like classification and regression problems. Then we have unsupervised learning where we try to understand patterns within the data and there's data but there's no label. Example dimension audio reduction and clustering and then we have the third pillar which is reinforcement learning. And reinforcement learning is learning what to do that is how to map situations to actions so as to maximize a numerical reward signal. So it maps states to actions to maximize a reward. Algorithms that solve problems in this way are known as reinforcement learning algorithms. And here's a snapshot of a few pretty common ones. Coming back to this definition, I hope this reinforcement learning algorithms component now makes more sense. So let's peel off another layer. Temporal difference learning is a method that value based reinforcement learning algorithms use to learn. So what is this value based reinforcement learning algorithm? Reinforcement learning algorithms at the end of the day are trying to maximize a total reward. This can happen in multiple ways and so we can subcategorize reinforcement learning algorithms into value based methods or policy based methods. Value based methods will determine a value function which quantifies the total reward and using this value function we will determine an optimal policy. Whereas policy based methods will determine an optimal policy directly no value function and optimal policy is that policy that maximizes the total reward. And just to note, policy is just how an agent behaves in a certain situation or state. So given a state, what action will it take? That's what the policy determines. Example of value based functions is Q learning among others. An example of policy based methods is proximal policy optimization among others. So now coming back to our definition, I hope this value based reinforcement learning term makes a little more sense. Now let's peel off another layer. Temporal difference learning is a method that value based reinforcement learning algorithms like Q learning use to learn state value functions or state action value functions. So what are these state and state action value functions now? So value based methods determine a value function which is used in turn to determine the policy that will maximize the total reward. Value function is a function and it has inputs and it gives some output. And depending on the inputs, there can be two types of value functions. We have a state value function, the input is a state, and then we have a state action value function where the input is a stake and an action. So states actions, what are they? A state is a snapshot of the environment and the action is the decision taken by the agent in an environment. Now state value functions will take the state as input and output a real number. The state action value function will take a state and action as an input and output a real number. And this real number is known as a Q value. And so the state value VFS will quantify how good is it to be in this specific state S. Whereas a state action value or Q value will quantify how good is it to be in this state S and take an action A in this state. Coming back to this definition, I hope this state value function and state action value function piece makes a lot more sense now. The definition is almost completely uncovered. But to uncover the last piece, we'll need to talk about temporal difference learning itself. So let's take a fully observable grid world with nine squares. There's a plus 10 reward square, a negative 10 poison square, and the rewards for all these other squares is negative one. And the goal of our agent is to get to this plus 10 reward spot in the best possible way. More technically speaking, we want the agent to learn the optimal policy. And let's say that we want to use a valued based reinforcement learning algorithm to do this. And because it's value based, we need to determine a value function. Let's say that we have a finite number of states and we start by storing the value of every state in a table. And let's start initializing it to zero. Just to be clear, the value of each of these columns should be how good is it to be in this state? Or more technically, what is the total reward that I will see being in this state S1? Similarly, the value of state S2 is how good is it to be in a state S2? Or what is the total reward that we will observe when we are now currently in a state S2? Say that this agent starts with the policy of exploration. This means that whatever state it's in, the action taken will be random. So the agent is now in state S1. It can take either a right or go down. But because this is a random policy, let's say that it randomly chose to go right. Now the agent took the action. And because it took an action, it now observes a reward and also it transitions to another state. So the reward here is negative one. And let's say that it transitioned into a state S2. From our trusty Bellman equation, more details of which you can see in another video. But from this equation, we know that the value of the state S1 is the reward that we receive transitioning into the next state S2. Plus the value of the state S2. So this would be negative one. Plus, well, from our table, it looks like it's zero for the value of S2 because we initialized everything to zero. And we get the total value of state S1 as negative one. Now the observed value that we observe for state S1 is negative one. But the expected value, which is what's in our table is zero. And the difference between the observed value and the expected value of the state is known as the temporal difference error. It's temporal because it calculates the error between two different time steps. And we are computing the difference between these values of these two time steps. Because there's a difference between these two values, we need to now update the value of state S1 in our table, which we can do by this formula. Now alpha here is a learning rate depends on how much we want to change our values or how fast we want to learn in the table. In this case, if you plug in the values, you'll get the value of state S1 is negative 0.1, which we can update the table. So this is the first time step, but let's go through another time step just to make sure that the process is clear. So now we are in a time step S2 in a state S2 and from here we can take multiple possible actions. And let's say that we go down. So we take that action. So we end up with the reward, which is negative one. And then we end up in a new state. And let's say that this new state that we end up is S7, some state. We then use our trusty Bellman equation to calculate the observed value of state S2. So it's going to be the reward of transitioning into S7 plus the value of S7, which is negative one plus zero, negative one. We then calculate the temporal difference error, which is observed minus actual, still negative one. We then calculate the value of state S2, which is then to be updated in the table. And then we just repeat the sequence of steps over and over, updating the table values until we reach the end. And this is considered to be one episode. We can then perform multiple such episodes over and over again, adjust these values in these tables until they become stable. And effectively, the value functions are learned. And once they are learned and agent, instead of taking some random action, it can now take an action based on the value functions in this table. So the value functions would determine the actions, which means that it determines the policy. Optimal policy or the best policy will be determined by what is the best value function given your state. In the textbook linked in the description down below, these are the steps that we just talked about. And so we come back to our definition for the last time. Temporal Difference Learning is a method that value based reinforcement learning algorithms like QLearning use to iteratively learn state value functions or state action value functions. And now I hope this makes a lot more sense. There are other versions of temporal difference learning when you take multiple time steps instead of single time steps. But I hope that this video helped you understand the concept itself. Thank you all so much for watching. If you like this video and you think I do deserve it, please do give this video a thumbs up. Also, don't forget to subscribe, hit that bell button for notifications, and I will see you in another one. Bye-bye.