 Hi everyone, this is Alice Gao. In the previous video, I talked about the motivation for using a Markov decision process to model an ongoing decision problem. Then I described the components of a Markov decision process. In this video, I will talk about how should we choose the reward function for a Markov decision process. Recall that for a finite stage problem, it is straightforward to think about how to model an agent's utility. There is a finite number of decisions to make, so there is a finite number of combinations of values for these decisions. We can write down a giant table, list all these combinations, and for each combination write down an agent's utility. So that's doable. Now for an ongoing problem, it's less clear how to do this because we have an infinite, potentially infinite number of time steps. So how should we model the agent's reward or utility over time? Let's look at three possible reward functions I put on this slide. Total reward, average reward, and discounted reward. For all of these reward functions, let's simplify our notation for the reward function a little bit. So let's assume that the reward only depends on the state I'm entering. So r of s is a reward of entering a particular state. Then let me write down the expressions for total reward, average reward, and discounted reward. When we're calculating total reward, we are simply taking the reward we get at each time step and adding them together. So the sum is the total reward. And when we are calculating average reward, we are again calculating the total reward, except that we are dividing that by the number of time steps in total. Now, since this is an ongoing process, the number of time steps might be infinite, so we'll represent that by taking the limit of the number of time steps to approaching infinity. And finally, for discounted reward, this might be a new concept for you. So I will explain this later after I explain the total average reward. So there are some problems with the first two approaches. What's the problem with using total reward? The problem here is relatively easy to see. If we have an unlimited number of time steps, then the sum of all the reward is likely to be infinite. So if we can get an infinite amount of reward for a particular process, then how do we compare which set of actions might be best, right? After all, our goal is to be able to solve this process and derive the optimum policy. So if we have two possible policies and the total reward for each policy for both policies are both infinite, then there's no way for us to tell which one is better. The average reward function has a similar problem. The problem is that in this case, if the total reward is finite, but if we average it over a potentially infinite number of time periods, then the average reward is going to be zero. So in that case, if we have two policies, both resulting in the finite total reward, then the average reward of both policies are zero. And again, we cannot compare them. Since both the first approaches have problems, you might have guessed that we're going to go with the third approach, which is modeling the reward as a discounted reward function. Now, you might not have heard of this concept, but the discounted reward is a widely used concept in microeconomics and in particular in game theory. So the idea is that for the reward for every time step, for the first time step, the reward is just what it is. So r of s zero, for then for the next time step, we will take the reward and multiply it by a discount factor. So gamma multiplied by r of s one. And then for the next time step, we will take the reward and multiply it by the discount factor squared. So every time, every new time step, we will take the original discount factor from the previous time step and multiply it again with the same discount factor. So this discount factor is represented by gamma, and it is a number that's greater than or equal to zero and less than one. Now, a large part of microeconomics in game theory is about modeling people's decision processes in real life using mathematics. So there is quite a bit of reason for why we want to model the reward in this discounted way. Let me give you two reasons. The first reason is that if we are going to get a fixed amount of reward, then we prefer getting this reward today rather than tomorrow, rather than the day after tomorrow. So this means that we value the same amount of reward received sooner rather than later. We prefer to receive reward sooner rather than later. And this has evidence in psychological experiments. People have done lots of experiments with real people and have found that people really have this preference of receiving reward sooner rather than later. So that's one reason for using the discount factor is that if I were to receive the same reward today, it's worse more than if I would receive it tomorrow. So if I receive it tomorrow, I will multiply it by a discount factor to show that it is actually worse less to me if received tomorrow. The second reason might sound a little depressing, but it actually makes sense. The reason is that every day there is a chance that tomorrow will not come. So we will use the discount factor to model the fact that there is some positive probability that tomorrow would not happen. So with only, for example, if the discount factor is 90%, then we are saying that with 10% chance the world is going to end tomorrow. So with only 90% chance tomorrow will actually arrive and we will get the reward that we're supposed to get. I hope you don't take such a pessimistic view on life, but apparently economists do. Now another advantage of using the discounted reward function is that we can prove if we use a discount factor, then the total discounted reward over a potentially infinite number of time steps is finite. So that's a nice thing where if we're comparing two policies, the discounted reward function will give us two finite rewards for the two policies, so we can compare them and get an answer about which policy is better. Because of all these reasons, we are going to choose to model our MDP using a discounted reward function. That's everything for this video. After watching this video, you should be able to explain what are the total reward, average reward, and discounted reward function. And you should also be able to explain why we chose to use the discounted reward function rather than using the total reward and the average reward functions. Thank you very much for watching. I will see you in the next video. Bye for now.