 Hi everyone, this is Alice Gao. In this video, I'm going to talk about policies for a Markov decision process. When we're solving a Markov decision process, we are looking for a policy which tells the agent what to do. Now, it is not trivial to think about what a policy looks like for a Markov decision process. That's what we're trying to figure out in this video. Here's the first idea that you might have. You might think that in the Markov decision process, it's sufficient to tell the agent a fixed sequence of actions. Here's an example. Suppose we're considering a fixed sequence of actions, starting with a start state, say, go down, go down, go right, go right, and go right. And your motivation might be, if we start from the start state, S11, then down, down, right, right, right, we'll get the robot to the state 3-4, it will get the reward plus 1, and be happy, and escapes this world. So I've designed two quicker questions, and let's use these two questions to think about what happens if we ask the robot to follow a fixed sequence of actions. In this first question, let's simplify our grid world a little bit. Remember that our grid world has some uncertainty in it. So when the robot tries to go in a particular direction, it might not get there. It might get to its left or its right instead. So let's simplify this first by assuming that the environment is deterministic. So in other words, you can think about if the robot tries to go down, it would definitely get to the state below the current state. With that assumption, think about this question. If the robot is deterministic, an optimal solution to the grid world problem is this fixed action sequence that's given below. Is this true or false? Think about it yourself, and then keep watching for the answer. The correct answer is true. In a deterministic world, a fixed sequence of actions is indeed one of the possible optimal solutions. Because when we take an action, we don't have to worry about ending up in a place that we did not anticipate. In fact, there is another possible optimal solution. We can start by going right, right, and then down, and then down, and right. There are many possible sequences of actions that will get us to the nice ghost state, the state S34 for sure. What you should remember from this question is that a world is really nice if it's deterministic. Now let's think about the actual grid world we are in. So in the actual grid world, an action may not achieve its intended effect. It has that 80%, 10%, 10%, noisy distribution there. So the robot might not end up where it intended. Given this grid world, think about the following question. Consider again, the fixed sequence of actions, down, down, right, right, and right. True or false? This action sequence could take the robot to more than one square with positive probability. Think about this yourself and then keep watching for the answer. The answer to this question is true. With a fixed sequence of actions, this could actually take the robot to multiple possible states with positive probability. Let's look at some examples. I've written down two examples for you, but you should be able to come up with lots of examples on your own. First example is a bit extreme. So if we start from the initial state S11, after five actions, we might end up in the same state. How is this possible? Well, when we try to go down, we could in fact end up going to our left. That's with 10% chance. So if we go to our left, we bump into a wall and stay in the same square. And then when we try to go to the right, we might actually end up going up. That's to the left of the robot with 10% chance. So if we go up three times again, we bump into the wall and stay in the same square. So we could end up reaching staying in the same state with a probability of 0.1 to the power of five, right? So all five actions with 10% chance will cause us to stay in the same square. Then I came up with another example where we ended up reaching the state S14 and we can reach this state by when we're trying to go down, we end up going to our right. That's with 10% chance. So we'll do that twice. We'll end up in S13 here. Do that twice, end up in S13. And then we'll go right three times and we succeed. So once twice bump into the wall, come back and then three times bump into the wall and come back. So the probability for all of that is 10% chance for not going in the intended direction and then 80% chance for going in the intended direction three times. You should be able to come up with lots of other examples. And in fact, you should, this is my guess. I haven't done this, but you should be able to come up with one example for every possible state in the grid world. So I conjecture that with positive probability, this sequence of action could take the robot to any square in the grid world. Now what's the purpose of this question? Remember that our goal was to figure out what does a policy look like for a Markov decision process? And the purpose of this question is to show you that a fixed sequence of actions is not enough as a policy for Markov decision process. And the reason is that a fixed sequence of action could take us anywhere. We might end up anywhere because of the uncertainty in the transition probabilities. So because of this, the policy has to look quite different. In fact, a policy needs to take into account of any contingency. And the contingency here means that we always have to assume that we can end up in every possible state. So the policy has to specify for every possible state, what should we do? If we end up in a particular state, which action should we take? To summarize, when we're solving a Markov decision process, we are looking for a policy which specifies what the agent should do as a function of the current state. So we have to specify if we end up in every possible state, what action should the agent take? There are two types of policies. The policy could be non-stationary or it could be stationary. If the policy is stationary, then it's a function of both the state and the time. So for each state and for each time step, we need to specify what the agent needs to do. If the policy is stationary, then the policy does not depend on the current time step. Time is irrelevant. So as long as we end up in a particular state, the policy specifies what the agent should do regardless of the current time step. Because now we assume that our MDP is a stationary model, so we're going to go with a stationary policy as well. The concept of policy might still sound kind of abstract for you, but don't worry. In the next video, I'm going to show you quite a few examples to see what a policy might look like for our grid world. That's everything for this video. After watching this video, you should be able to explain why it is not sufficient to specify a fixed sequence of actions as a solution to an MDP. Also explain what a policy for an MDP should look like and the difference between a stationary and a non-stationary policy. Thank you very much for watching. I will see you in the next video. Bye for now.