 Hello everyone, this is Alice Gao. In this video, I will discuss the active version of the Adaptive Dynamic Programming or ADP algorithm. Previously, I discussed the passive ADP agent. The agent follows the fixed policy and learns the expected value of following the policy. However, in the real world, the agent is not limited to using a fixed policy. They have the complete freedom to follow any policy. The question is, what should the agent do? What action should the agent take at each time step? What is the effect of taking an action? In our grid world, taking an action serves two purposes. One, an action can provide rewards. This is a short-term benefit to the agent. Two, an action can help us gather more data to learn a better model. This is a potential long-term benefit. Based on these two purposes, there are two useful things that the agent can do. The first option is to exploit. This option may be what you think of by default. If the agent has learned a model consisting of the transition probabilities and the utility values, the agent should take the optimal action based on this model. An alternative option is to explore. The agent may want to take an action that's different from the optimal one. Why is this a good idea? To take a suboptimal action. For example, this action may take the agent to a state that they haven't been before. The agent may discover actions that are better than the one they have found so far. If an agent chooses to exploit all the time, we call this agent a greedy agent. Experiments have shown that being greedy all the time is a terrible idea. The greedy agent seldom converges to the optimal policy and sometimes converges to horrible policies. Why is being greedy a bad idea? The main reason is that, at any point, we have limited observations of the environment. Our learned model based on these limited observations is never the same as the true environment. Therefore, behaving optimally based on our learned model is often very different from the optimal behavior in the true environment. Because of this, the best strategy for the agent is to perform exploration and exploitation at the same time, trying to maintain a balance between the two. This has been studied in depth in the subfield of statistical decision theory that deals with a class of problems called multi-armed bandit problems. What is the best way to maintain the trade-off between exploration and exploitation? Let's look at a few strategies. Strategy number one. We'll take a random action or explore an epsilon fraction of the time and we'll take the best action or exploit a one minus epsilon fraction of the time. Epsilon is a parameter that we need to set. In practice, we may want to start with a relatively large epsilon and decrease it over time. In this way, we explore more at the beginning and decrease exploration as time goes on. Strategy two. One problem with strategy one is that it treats all the suboptimal actions in the same way. Some suboptimal actions may be better than others. One way to improve strategy one is to select each action with a probability that's proportional to the expected utility of taking that action. This leads to the idea of soft max selection. We can do this using the Gibbs or Boltzmann distribution. Take a look at the formula. The probability of taking an action a is proportional to q of sa, which is the expected utility of taking action a in state s. t in the formula is a parameter called the temperature. We can use t to adjust the shape of the distribution. When t is high, the distribution is close to a uniform distribution and we would be choosing every action with similar probabilities. When t is low, the distribution does a better job of distinguishing the actions with different expected utilities. We're more likely to choose an action that has a higher expected utility. In the limit, when t approaches zero, the distribution approaches a point mass and we will be choosing the best action with probability one. Strategy three. Instead of changing the probability of choosing an action, we can also encourage the agent to explore by changing the utility estimates. This is the third idea, using optimistic utility estimates to encourage exploration. We'll look at the strategy in more detail shortly. Previously, I introduced some ideas to encourage exploration. The last idea was to use optimistic utility estimates. Let me describe the strategy in more detail. The agent is more likely to try an action that leads to a higher expected utility. To encourage the agent to explore, we can set all the utility estimates to be large in the beginning. These utility estimates are called optimistic utility estimates. Once the agent has tried the state action pair a certain number of times, we can revert back to computing the utility estimates based on our observations. To do this, we have modified the development updates. Let v plus denote the optimistic utility estimates. In our original Bellman update, after the max a, we are calculating the expected utility of taking action a in state s. This quantity is the sum of the transition probability multiplied by the utility value. In our modified Bellman update, we're calculating the expected utility of taking an action using the f function. This f function is called an exploration function. It can help us trade off exploration and exploitation. F takes two parameters. The first one is the actual expected utility given an observation so far. The second one is the number of times we have visited the state action pair. If we haven't visited the state action pair at least n sub e times, then we assume that the expected utility of taking the action is r plus. r plus is the optimistic estimate of the best possible reward obtainable in any state. In general, r plus is a large value, making the state action pair very attractive for the agent. Once the agent has tried the state action pair at least n sub e times, f will return the actual expected utility of taking the action computed using our observations. In short, the optimistic utility estimates encourage the agent to try the state action pair at least n sub e times. For this strategy to work, the exploration function f should be increasing in the utility value u and decreasing in the number of times that we visited the state action pair, which is n. Let's look at the steps of the active ADP algorithm. The active ADP algorithm combines the passive ADP algorithm and the exploration strategy. For example, let's combine the passive ADP algorithm and the third exploration strategy, making use of the optimistic utility estimates. We start by initializing some values arbitrarily. The reward function, the optimistic utility estimates, and the counts for estimating the transition probabilities. Next, we'll go inside the loop. We'll keep going through the loop until the optimistic utility estimates converge. Inside the loop, the algorithm alternates between two tasks. One, take the optimal action based on the current utility estimates. Two, update the utility estimates by using the new experience. For steps three and four, we take the current optimistic utility estimates and determine the optimal action for the current state. The agent takes the optimal action and generates an experience. For steps five to seven, given the experience, we're going to update a few things. First, we'll update the reward function if we haven't observed the reward before. Second, we'll update the counts for the state action pair for calculating the transition probability. Finally, we'll update the optimistic utility estimates iteratively using the Bellman updates until the utility estimates converge. Note that we're still using the optimistic utility estimates when we're performing value iteration or policy iteration. In this algorithm, we'll check convergence in two places. One, we need to check convergence during value iteration or policy iteration. We will perform value or policy iteration until the utility estimates converge. Two, we need to check convergence during consecutive iterations of the loop. After each iteration, we'll check how much the utility values differed from those in the previous iteration and whether they converged. That's everything on active version of ADP algorithm for reinforcement learning. Let me summarize. After watching this video, you should be able to do the following. Explain why we want to balance exploration and exploitation. Describe some exploration strategies. Describe the steps of the active ADP algorithm. Thank you very much for watching. I will see you in the next video. Bye for now.