 Hello everyone, this is Alice Gao. In this video, I will continue discussing the passive adaptive dynamic programming algorithm. Let me show you an example of tracing the execution of the passive ADP algorithm. Here's a 2x2 grid. There are two non-goal states, S11 and S21, and two goal states with rewards plus 1 and minus 1. The agent is given the following. The current policy, the discount factor, the current estimates of the reward function, and the counts. For the counts, assume that we have encountered each state action pair five times. Out of the five experiences, we travel in the intended direction three times. We travel to the left of the intended direction one time and to the right of the intended direction one time. Our current transition probability estimates are, the probability of traveling in the intended direction is 0.6 and the probability of traveling to the left or to the right of the intended direction is 0.2. The current state is S11. Let's go through the loop once. The current state is S11, the policy says that the agent should go down. Suppose that the agent reached S21 and received the reward of minus 0.04. The experience is S11 down S21 and minus 0.04. Let's do some updates. We have observed this reward before, so there's no need to update the reward function. For the counts, N of S11 and down should be 5 plus 1, which is 6. N of S11 down S21 should now be 3 plus 1, which is 4. The new transition probability is, the probability of S21 given S11 and down is 4 divided by 6, which is around 0.667. With the updated reward function and the transition probabilities, we're ready to calculate utility values. There are two non-go states, so we can write down the Bellman equations and solve them exactly. Here are the two Bellman equations, one for S11 and the other one for S21. Let's look at one of them. We receive an immediate reward first, and then let's think about the future. For the future, we need to multiply by the discount factor of our expected utility. Our expected utility depends on which direction we're going to move in the future, and we don't know that, so we need to take into account of the transition probability and the expected utility of traveling in that direction. With probability 4 over 6, we are going to travel in the intended direction, which is going down according to the current policy, and we'll reach S21 and get the expected utility there. With probability around 1 over 6, we are going to travel to the left of the intended direction and reach the goal state plus 1, so we get the utility plus 1. And finally, with a probability of another 1 over 6, we are going to try to travel to the right of the intended direction, and in this case, we'll bump into the wall and come back to the same state, and the utility of the same state is V of S11. Note that this equation only has two variables, one is V of S11 and the other one is V of S21. Given these two equations, you can solve them using your favorite linear algebra technique. I put the answers below so you can verify them. That's everything on the passive version of the ADP algorithm for reinforcement learning. Let me summarize. After watching this video, you should be able to trace and implement the passive ADP algorithm. Thank you very much for watching. I will see you in the next video. Bye for now.