 Hi everyone, this is Alice Gao. Welcome to another video on decision making under uncertainty. In this video, I will start talking about a new tool for solving this kind of decision problems. This tool is called the Markov Decision Process. In the previous few lectures, I focused on talking about decision problems with a finite number of stages. To model this kind of decision problem, it's sufficient to have a finite number of decision nodes and incorporate them into a decision network. However, in general, we may have to solve ongoing problems. For example, we can have a problem with an infinite time horizon. This means that the process may go on forever. After all, time does not stop for anyone, it goes on forever. Another kind of ongoing problem is one with an indefinite time horizon. In this case, the agent will eventually stop, but it does not know when it will stop. You can imagine a world with some gold states. If the agent reaches a gold state, it will stop. If it doesn't reach any gold state, it will continue exploring. So in this case, the agent does not know when it will reach a gold state and stop. For either type of ongoing problem, because we don't know how many time steps it is going to take for the agent to reach any end or whether there is an end at all, so we cannot use a decision network to model this kind of problem. Because of the difference between a finite stage problem and an ongoing problem, we will have to think about our utility function differently. In a decision network, we can make multiple decisions in sequence and then calculate our utility at the end. But in an ongoing problem, it does not make sense to consider the utility at the end. There might not be an end for an infinite horizon problem, or we don't know when we'll reach the end for an indefinite horizon problem. So instead, we will consider a sequence of reward, one for each time step. This kind of reward may incorporate several things. It could incorporate the cause of taking certain actions in some states. It could also incorporate any rewards and punishment we receive along the way, as we are going through the process and trying to solve the problem. Let's look at a graphical representation of a Markov decision process. Because there's an unlimited number of time steps, we cannot show the entire model. So this picture is only showing you the first few time steps of the model. So we are starting with the state S0, and in this state, we can take some action called A0, and this action may cause us to transition to a new state S1. And because of this action, because of the state transition, we are going to receive a reward R0. You can see that this is basically a decision network we're using the same notation, same kind of nodes. So we have the circles for representing the random variables. We have the rectangles for representing the decision nodes and the diamonds for representing the utility nodes. And also this process is going to keep going on, right? I'm only showing you the states up to S3, but you can imagine this keeps going. We will have new actions and new rewards for the future states as well. So in order to define a Markov decision process, we need to specify the following components. We need to specify a set of states. We need to specify a set of actions that are available in each state. We need to specify the transition probabilities. So given a state and given a particular action, we're taking that state. What's the probability that we're going to go into, what's the probability that the new state is a particular one? Similar to how we dealt with the hidden Markov model, we will again assume that the Markov process is stationary, which means the transition probabilities remain the same for each time step. Finally, we need to define a reward function. In the most general form of the model, the reward depends on the initial state we started in the action and the result, the state we ended up in after taking the action. So S, A and S prime, for some special cases, some of the parameters in the reward function might not be important. For example, we may have a model where it doesn't matter what the state S is, as long as we enter a particular state for S prime, the reward is the same. We'll see examples later on. Let me stop here for this video. In the next video, I will talk about how to model the reward function for a process that may keep going. After watching this video, you should be able to describe what are decision problems with an infinite time horizon and what are decision problems with an indefinite time horizon. Describe how we should model the agent utility in such ongoing decision problems. Describe components of a Markov decision process. Thank you very much for watching. I will see you in the next video. Bye for now.