 Hello everyone, this is Alice Gao. In this video, I will introduce the Q learning algorithm for reinforcement learning. Let me introduce the Q learning update rule. Q learning is an example of temporal difference learning. The key idea is to update the Q values proportional to the temporal difference error. Given an experience, a transition from state s to state s prime by taking action a, we will update Q of sA as follows. Take the original Q value and add a portion of the temporal difference error. The temporal difference error is equal to the predicted Q value minus the current Q value. The predicted Q value is the immediate reward of entering state s plus the discounted expected utility of taking the best action in the next state s prime. Alpha is the value between 0 and 1. Since we multiply the temporal difference error by alpha, the change in the Q value is only a portion of the temporal difference error. Let's rearrange the terms and write the update rule in another way. This alternative version might appear more intuitive for you. In this alternative version, we're changing the Q value to a linear combination of two terms. The first term is the current Q value. The second term is a predicted Q value based on the observed transition. Alpha controls the weights of the two terms. If alpha is large, the predicted value has more weight and we're potentially making a large change. On the other hand, if alpha is small, the current value has more weight and we're likely making a small change. Let's take a look at the passive version of the Q learning algorithm. Recall that in passive reinforcement learning, the agent has a fixed policy and the goal is to learn the expected utility of following the policy. In this case, our goal is to learn the Q value, which is the agent's expected utility of taking action A in state s. The passive Q learning algorithm is quite similar to the passive ADP algorithm. You might want to compare them side by side. One major difference is that for Q learning, we do not need to update the counts to learn the transition probabilities. The other main difference is that the last step is on updating the Q value using the temporal difference error rather than updating the V value using the Bellman equations. I have made a special note about alpha. Alpha is called the learning rate. It is similar to the learning rate in the gradient descent algorithm. The magnitude of alpha controls the size of each update. In the algorithm, I wrote alpha to be a fixed value. In practice, it's better to change alpha as we receive new experiences. Let N of sA denote the number of times the agent has taken action A in state s. Roughly speaking, if alpha decreases as N of sA increases, then the Q values will converge to the optimum values. One example of such an alpha function is alpha is equal to 10 over 9 plus N of sN A. Time to look at the active Q learning algorithm. Again, the structure looks quite similar to the ADP algorithm. However, I would argue that active Q learning is much simpler than active ADP for one reason. The learning algorithm doesn't care about the policy that the agent is following. Let's take a closer look. Similar to active ADP, we can decompose active Q learning into two parts. The bottom part is basically a copy of passive Q learning. Given that experience, we will update the reward function and update the Q value using temporal difference update. However, note that this update does not reference the current policy in any place. In particular, for the next state s' the learning algorithm assumes that the agent is following the greedy policy. That is, the agent chooses the best action based on the Q values and achieves the expected utility of the maximum of Q of s' a' over the action a'. However, the agent may not be following the greedy policy at all. This learning algorithm works regardless of the policy that the agent is following. The top part of active Q learning is to determine the agent s action given the current state. This part does depend on the current policy that the agent is following. This version uses the optimistic Q values to encourage the agent to explore. If the agent hasn't tried the state action pair at least any times, then we assume that the Q value is r plus, which is the maximum possible reward we can obtain any state. Having r plus as the reward makes the state action pair really attractive to the agent. Once the agent has tried the state action pair at least any times, we will use the current Q value instead. Active Q learning only needs to check convergence in one place. Every time we go through the loop, we will check whether the agent has visited each state action pair at least any times and whether all the Q values have converged. If both are satisfied, we will terminate the algorithm. That's everything on the passive and active Q learning algorithms. Let me summarize. After watching this video, you should be able to do the following. Describe the Q learning update rule. Trace and implement the passive Q learning algorithm. Trace and implement the active Q learning algorithm. Explain how Q learning does not care about the policy that the agent is following. Thank you very much for watching. I will see you in the next video. Bye for now.