 Hi everyone, this is Alice Gao. In this video, I'll start talking about the 3x4 grid world. I'm going to use this as the running example throughout this unit Markov decision process. The robot is trying to explore this grid world. In the grid world, there are three rows and there are four columns. So the first row and the first column in this picture are simply indices for the rows and columns. In this grid world, let's use a coordinate system. So let's use s of ij to represent the state or the position in row i on the column j. So the top left corner, we have s11 and the bottom right corner, we have s34. S11 is the initial state. I've labeled this in the picture as the start. We have a wall at s22. So this is represented by this x at that position. So having a wall means that the robot cannot get into that square. There are two goal states, s24 and s34. So if the robot gets to either goal state, it's going to escape the world. So think about that state as an absorbing state. If the robot gets there, that's the end. That's the end of the process. So we have the reward at s24 is minus one and the reward at s34 is plus one. So these are the only two goal states in the entire world. This world is kind of not very interesting and the robot is exploring a lot and ideally it's trying to get to the state s34 and avoiding the state s24. There are a few additional components of the problem that I need to describe to you. I need to describe the actions. I need to describe the transition probabilities and also the reward function. For each state, there are four possible actions. The robot can go up, down, left and right. And in this example, let's assume when we say up, down, left and right, it's from our perspective, not from the robot's perspective. From the robot's perspective, it's really only going in whatever direction it is facing right now versus to its left, to its right, or to its behind. We are going to assume that every action is possible in every state. Now you might have a question about this. What if the robot bumps into a wall? We're going to take care of that when we define our transition probabilities. Let's talk about our transition model. So the transition model in general is a conditional probability distribution. It says starting from a particular state s, if we take an action a, then what's the probability that the robot is going to reach the state s prime? s prime may be different from s, or may be the same as s. Remember that we have a stationary process. That means we only need to define one conditional probability distribution. This distribution is going to work for all of the states. This great world the robot is in is full of uncertainty. And one kind of uncertainty it has is that every action may not achieve its intended effect. So let's take a look. In this world, and any of the four actions up down left or right, achieves its intended effect with probability 80%. That means if the robot tries to go to its left, for example, it successfully gets to its left with only 80% chance. So what about the rest 20% chance? Well, with 10% chance, the robot is going to turn to its left. So it's going to make a 90 degree left turn. And with another 10% chance, the robot is going to turn to its right, make a 90 degree right turn. Now in a lot of scenarios, the robot might bump into a wall. In that case, the robot stays in the same square. So it bumps into a wall and comes back and remains in the same square. Let me draw a little picture to illustrate how the transition model works. Suppose this is a current state. And suppose that the robot is trying to go down, go to the state below, below the current state with 80% chance is going to succeed with 10% chance, it ends up going to its left, which is our right. And with another 10% chance, it ends up going to its right, which is our left. All right, so by using this picture, you can see that the transition model is sort of specified from the perspective of the robot with 80% chance, it goes in the intended direction with 10% chance, it goes to its left. And with another 10% chance, it goes to its right. Finally, let's talk about the reward function. So by the reward function here, I don't mean the total discounted reward over time. We're not going there quite yet. We're only talking about the reward of entering each state. So r of s is denoting the reward of entering state s. Recall here that we've simplified the reward function notation a little bit. So the reward is supposed to depend on the starting state, the action and the resulting state. But here we've only included the resulting state. We're saying that it doesn't matter which state we started with and doesn't matter what action we took. As long as we end up in this state, we'll get this reward. So this is mostly a boring world, which means there are only two go states. If we get to the state 2, 4, then we get a reward of minus 1. We want to avoid this state, if possible. If we get to the state 3, 4, we'll get a reward of plus 1. We like this state, we like to get to this state if possible. Otherwise, for any other state, if we enter any other state, we're going to get a small negative reward, minus 0.04. This small negative reward is used to model the fact, the cause of taking actions, the cause of expending energy to explore this world. The robot is roaming around in this world and exploring it. Ultimately, its goal is to get to one of the go states, but it has to explore in order to eventually get there. And as it moves around, the moves, the actions, deplete its energy. So we're using this small negative number to represent that. This is a reasonable assumption for a practical scenario as well. In practice, exploration requires energy. So as we explore more and more, we're going to have less and less energy, and this should be reflected in our utility function. It can be tricky to understand the transition model. Let's look at what practice question to make sure that you really understand how the transition model works. Here's a question. Suppose a robot is currently in state S14, I've highlighted the state there for you, and the robot is trying to move to our right. So let me draw the intended direction of the robot for you. Given this intended direction, what is the probability that the robot stays in the same state as S14? Think about this yourself, and then keep watching for the answer. The answer is short, so I'm going to include it in this video. The correct answer is D. With 90% chance the robot is going to stay in the same state. Here's why. The intended direction is to our right. If the robot goes in the intended direction and succeeds, that's 80% chance. But on our right, there's a wall. So the robot is going to bump into the wall and come back to the same state. So it's 80% chance it's going to come back to our current state. And then with 10% chance the robot will try to go up. But there is another wall there. So if it tries, then it's going to come back to the same state with that 10% chance as well. With another 10% chance the robot is going to try to go down. If it does that and succeed, then it will get to one of the goal states and escape from the world. So all in all, with 80% chance plus 10% chance the robot is going to bump into a wall and come back to the same state. So that's a total of 90% chance for which the robot will stay in the same state. That's everything for this video. After watching this video, you should be able to explain the components of the grid world and especially explain the transition model. Thank you very much for watching. I will see you in the next video. Bye for now.