 Hello everyone! This is Alice Gao. Welcome to the first video on reinforcement learning algorithms. In this video, I will introduce reinforcement learning and then describe our first reinforcement learning algorithm called adaptive dynamic programming. I'll focus on the passive version of the ADP algorithm. Let's look at the basic setting of a reinforcement learning problem. Recall that there are three types of problems in machine learning. In supervised learning, we have labels for every example. For our supervised learning, we have no label for any example. Reinforcement learning is somewhere between the two. In the reinforcement learning problem, we're given some numeric feedback called rewards or punishments once in a while, and we want to determine what to do at each time step given these numerical feedback. Let's consider a fully observable single agent reinforcement learning problem. We'll model this problem as a Markov decision process. At the beginning, the agent is given the set of states and the set of possible actions. At each time step, the agent observes the state and the reward since the environment is fully observable. After observing the state and reward, the agent carries out an action. The goal of the agent is to maximize its total discounted reward over time. Reinforcement learning is challenging for several reasons. First, the agent receives rewards infrequently. It is often difficult to determine which action or which sequence of actions was responsible for reward. For example, playing a chess game requires a lot of actions, but will only get one reward at the end, win or lose. Second, an action may have long-term effects on the agent's utility. Carrying out a seemingly bad action at the beginning may allow the agent to receive large rewards in the future. Choosing an action is particularly challenging since the agent does not know the effects of the action beforehand. Third, at any time step, should the agent explore or exploit? If the agent always exploits, it may not discover better actions. On the other hand, if the agent always explores, it never makes use of the learned knowledge to maximize its utility. The agent needs to carefully balance exploration and exploitation. Before I tackle the full reinforcement learning problem, let's consider a simpler setting, a passive learning problem. In this setting, the agent follows a fixed policy. Its goal is to learn the expected value of following this policy, that is, learn the utility value V for every state S. The passive reinforcement learning problem is similar to the policy evaluation step in the policy iteration algorithm. Given a policy, we want to learn the utility values V. However, this problem is more difficult for two reasons. The agent does not know the transition probabilities, nor does it know the reward function. Fortunately, we can tackle this problem using the same approach as policy evaluation, solving for the utility values using the Bellman equations. To do this, we must learn the transition probabilities and the reward function. We can learn both of these using the observed transitions and rewards as we navigate the world. In other words, we'll take actions when at a time based on the policy. Each action will allow us to observe the reward and the transition. We will use these observations to update our estimates of the transition probabilities and the reward function. This algorithm is called the passive adaptive dynamic programming algorithm or the passive ADP algorithm. ADP is a model-based algorithm because it requires us to learn a model of the world. The model consists of the transition probabilities and the reward function. Let's take a look at the passive ADP algorithm. There's a loop from step two to step five. The agent will go through this loop whenever it takes an action and generates a new experience. Step two, generating an experience. Suppose that the agent is in state S right now. The agent takes an action A based on the policy pi. This causes the agent to transition to state S' and receive a reward R'. The experience consists of S, A, S' and R'. The agent can use this experience to update its estimates of the utility values. Step three, update the reward function. Entering state S' gives us a reward of R'. If we haven't observed this reward before, we will record it for our reward function. Step four, updating the transition probabilities. The experience contains one state transition. Taking action A in state S causes the agent to transition to state S'. We can use this to update our transition probability. To calculate the transition probabilities, we keep track of two counts. N of S A records the number of times the agent takes action A in state S. N of S A S' records the number of times the agent takes action A in state S and lands in state S'. The probability of S given S and A is equal to N of S A S' divided by N of S A. We will increment the two counts and calculate updated transition probability. Step five, updating our estimates of the utility values. After updating the reward function and the transition probability, we are ready to estimate the utility values. Looking at the Bellman equations, the reward function is known, the count factor is given, and the transition probabilities is known. The V values or the utility values are the variables. We can solve for the utility values exactly since all the Bellman equations are linear. You can use your favorite linear programming technique to solve them. Alternatively, we can solve for the utility values iteratively using value iteration or policy iteration. That's everything on the passive ADP algorithm for reinforcement learning. Let me summarize. After watching this video, you should be able to do the following. Describe the basic setting of a reinforcement learning problem. Describe the steps of the passive ADP algorithm. Thank you very much for watching. I will see you in the next video. Bye for now.