 Hello everyone. This is Alice Gao. In this video, I will introduce temporal difference learning algorithms. I will focus on introducing a key concept called the temporal difference error. Previously, I introduced the ADP algorithm for reinforcement learning. ADP stands for Adaptive Dynamic Programming. ADP learns the utility values v of s by using the Bellman equations. You can see the Bellman equations for the v values on the slide. There's a related quantity called the q values. q of sA gives the agent's expected utility of performing action A in state s. v and q are closely related and we can define them recursively in terms of each other. Since v and q are closely related, we can write down the Bellman equations in terms of the q values as well. q of sA is equal to the immediate reward of entering state s plus the expected utility of performing action A in state s. If we perform action A with some probability, we will reach the next state s prime. In state s prime, we will perform the optimal action A prime based on the q values. The term the maximum of q s prime A prime gives us the agent's expected utility of performing the best action in state s prime. Since we have the Bellman equations for both v and q, learning the two kinds of values are equivalent. If we have an algorithm for learning the v values, we can convert it into an equivalent one learning the q values. There are pros and cons of learning the v values versus learning the q values. For example, one advantage of learning the q values is that we do not need to learn the transition probabilities. I will explain this idea in more detail shortly. I'm going to introduce two related reinforcement learning algorithms, q learning and SARSA, S-A-R-S-A. Both algorithms belong to a class of algorithms called temporal difference learning. Let me give you an example to explain the key idea behind temporal difference learning. Assume that we have received an experience. Starting from state s1, we received an immediate reward of r1. We took the action A and reached state s2. Based on this observed transition, how should we update the q value, q of s1 and A? Let me start with the Bellman equation for the q value. It is equal to the immediate reward of entering s1 plus the agent's expected utility of performing action A in state s1. Action A may take us to the next state s prime and the agent will perform the best action in state s prime based on the q values. Based on the Bellman equation, q of s1 and A should be calculated by the expression on the right hand side. There's one problem with this calculation. We don't have the transition probabilities. To solve this problem, let's make a simplifying assumption. Since this is the only transition we have observed so far, let's assume that this transition always occurs. In other words, let's assume that the probability of s2 given s1 and A is equal to 1. Given this assumption, we can simplify the right hand side expression. q of s1 and A is equal to the immediate reward of entering state s1 plus the discount factor multiplied by the agent's expected utility of taking the best action in state s2. Note that the transition probability disappeared since we assumed that the transition from s1 to s2 occurred for sure. This expression we just wrote down is our prediction of the q of s1 and A value based on the observed transition. If we take our prediction and subtract the current value of q of s1 and A, this difference is called the temporal difference error or the TD error. The key idea in temporal difference learning is to update the q values proportional to the temporal difference error. That's everything for this video. Let me summarize. After watching this video, you should be able to do the following. Explain why it is equivalent to learn the v or the q values, deriving expression for the temporal difference error. Thank you very much for watching. I will see you in the next video. Bye for now.