 Okay, we've now seen the broad reinforcement learning setting the sequential decision-making setting But in the next few slides, we'll be formalizing this even further So we'll be talking about this idea of the Markov decision process But before we get into that, let's actually kind of set up an example of a very simple Markov decision process And from that we will generalize more broadly to what a general Markov decision process looks like All right, so here is a very toy example of a Markov decision process and So in this case We have this grid of cells one of those cells happens to be occupied and the other cells are free and the agent starts at this location and the agent has to navigate this grid of cells in some way and It's going to receive at every time step a small negative reward. So for every So it's trying to accomplish some task We don't know yet what that task is but it's going to for every time instant that is spent in the grid It's going to lose some Some reward it's going to get a penalty. So a small negative reward corresponds to a small penalty It's time its aim is to complete the task as quickly as possible therefore right so whatever task it is it has to complete it as quickly as possible so that it doesn't accrue this negative reward over time and There is also a big bonus for if the agent ends up at this state and This state is also what's called a terminal state because once you reach the state then you don't have any more turns you have completed your episode and You don't need to do anything more in the environment. You can't do anything more in the environment similarly if you end up at this state, this is also a terminal state and Again in this case you get a large reward, but this time you get a large negative reward So you get a large penalty for ending up at this state All right So this is basically telling us what we need to accomplish. So both of these two together Which define the the rewards in this environment are telling us that what we need to do is get as quickly as possible to this square while trying to avoid falling into the square, okay, and The agent's action space is that it can move either forward or to the right or to the left or downwards so northeast south or west and If it tries to move, let's say south from this square Then it will stay where it is if it tries to move west from the square It'll stay where it is if it tries to move into a solid cell. It'll stay where it is, right? so All of those forbidden actions result in a no op so you don't really get to move your agent at all Which of course is a penalty because you are you've wasted some time, right? because you get a small negative reward for wasting time and Even if you execute a valid action, even if you try to execute a valid action Where for example from this state, you're trying to move forward that action is only executed correctly 80% of the time the remaining 20% of the time is split between two options So for example, if you were at this state and you try to move forward Then 10% of the time you will end up in this square instead and 10% of the time you'll end up in this square instead And so that's kind of visually depicted here at the bottom so this basically defines your entire Markov decision process for for the simple grid world environment and The goal of reinforcement learning is to try to maximize the sum of rewards Right, so we're going to try to maximize the sum of rewards over time until you terminate the episode either at that square or at that square So this is a simple example of a Markov decision process. Let's now try to generalize from All right, so before getting into the description of The Markov decision process. Let's quickly look at a visualization of what we saw in the case of the grid world as the state transition right, so If our grid world had been deterministic, then you could have drawn a state transition diagram in the following way This is what we'll call a state transition diagram where we map from one particular state to another state Right and the thing that's mapping it is an action So in particular if you execute the north action from this state where the agent is occupied in this x square Then you will move one step forward and therefore you'll end up in this state of the environment right and you can similarly draw what the states corresponding to the other actions would have been and In the stochastic version of the the environment that we actually described on the previous slide You wouldn't really end up with a single state in the end instead You would have a distribution over three states in particular. You'd have a 10 percent Distribution 10 percent probability of ending up in this state and 80 percent probability of ending up in this state and a 10 percent probability of ending up in this state So this is useful to remember as we start describing Markov decision processes in general In particular you could end up with a state transition diagram that looks quite complicated like this So this is an example of a general Markov decision process And so from every node every green node here corresponds to a state the red nodes correspond to actions that you can execute from that state and These actions then have corresponding transition probabilities For example, the way to read this is at s0 if you execute the action is 0 Then you end up at state s2 with a probability of 0.5 And you end up back at state s0 with a probability of 0.5 and these these Yellow arrows correspond to the rewards in this environment, right? so in particular if at s2 you execute the action a1 and The action a1 can result in a probability in In transitioning to s1 with a probability of 0.3 in transitioning to s2 with a probability of 0.4 and The remainder of the probability that is 0.3 is going to take you back to s0 All right, and if you did that if you got unlucky and you you Transitioned back to s0 then you get a negative reward of minus one. So the reward is therefore a function of The state that you executed an action from the action that you executed and the state that you ended up at All right So now let's start describing an mdp more formally and mdp is typically described by this Tupel of capital s capital a capital p and capital r you might be These two terms might sound familiar to you because we describe them briefly in the introduction and capital s is simply the step set of states s belonging to capital s so in the grid world it would be all the different configurations of the environment all the different cells that you can occupy and in This case the set of states would just be these three states s0 s1 and s2 The set of actions is again Capital a and it could in the grid world example It could be north south east and west in this case you can see that there are basically just two actions a 0 and a 1 The third thing that we haven't yet described formally But which is related to that diagram that we drew the state transition diagram that we drew is the transition function Which is often called the state transition function similar to that state transition diagram and all it says is the probability of transitioning into a new state s prime Given that you're currently at state s and executing the action a right. So that's that's what's specified by the transition function. So The probability that executing an action a from state s leads to state s prime. So for example The transition function corresponding to executing action a1 From state s2 and ending up at state s0. So that would be p of s0 given s2 comma a1 That is equal to 0.3. That's so you can read that off of this diagram And this is often also called the dynamics model or just the model of an environment okay, and The fourth item in the tuple that completes our description of the MDP is the reward function and the reward function is like we said These yellow arrows are functions of the state that you started at the action that you executed and The state that you end up at so that's r of s a s prime and Sometimes in some MDPs this can be abstracted out as just r of s because in those MDPs it might be the case that Your rewards are always just functions of the state you end up at Additionally, sometimes you're specified a distribution over start states or or terminal states Now when we apply reinforcement learning we typically do not know the true functions Corresponding to state transition function and the reward function instead. We only get samples from them So in other words, we don't know the state transition diagram in advance instead We have to kind of learn through trial and error And in that process of learning through trial and error will encounter samples from the transition diagram So you will execute an action a1 from state s0 and you'll find yourself at s2 and At that point you can say that you can only say with certainty that sometimes when you execute action a1 Then you end up at s2 starting from s0 Right, you can't say with certainty after having only done this once that you will end up at s2 each time But after you've repeated this process many times you could potentially have enough information to To know that right so typically you don't actually have access to the underlying function You instead only observe samples from it, right? That's what I mean by observing samples from it So if you execute an action a0 from s2, then you'll sometimes sample the state as 0 as a result And sometimes sample the state s2 as a result Similarly for the reward, you don't really know for with certainty that the reward is is going to be You know, you don't know exactly what function it is It could be a function of sas prime It could be just a function of the state It could be a function of just the action that you execute it and you have to learn all of that So you only get samples from it So let's quickly cover why we call this a Markov decision process. This is named after the Russian mathematician Andre Markov and And these are called Markov decision processes because they have what's called the Markov property, which is that given the present state The future states and the past states are probabilistically independent of each other right and in other words Everything that you need to know about the past is already included in the present state You don't have to to predict what will happen in the future You don't have to look into the past You only have to look at the present the description of the present in terms of the state Right, so that state variable in our Markov decision process is key because you can it's a Markov state right, it's it's only a Markov decision process if the state is a Markov state which means that You should have this property that predicting what the state is going to be a time t plus 1 Given the state at time t and the action at time t and all the previous states and actions That problem actually reduces because you have the Markov property it reduces so that you can actually remove all these variables because We said that you only need to know the present state in order to be able to predict what happens in the future so we don't really need to know anything about the past at all and So that's the reason that our transition probability is written just as This expression so st plus 1 given st comma at or s prime given s comma a as we will sometimes write it Okay, so now let's apply this abstraction to a few different examples So if you're trying to train a dog Right if a dog is trying to learn if a dog is the learning agent and you're trying to train the dog Then you can think of the actions As the muscle contractions of the dog if you're trying to teach it a particular trick It's observations are it sees you it it smells the food that you're holding in your hand for example and The rewards are the food itself, right? So if it actually gets to eat the food then Then that's a reward Similarly, if you're trying to train a humanoid robot to perform let's say a backflip then your actions could be The motor currents or torques in the various motors throughout the throughout the robot your observations would be the camera images from the robot and your rewards would be some measure of Task success like your ability to perform a backflip or maybe you're just trying to train the robot to run fast and If you're doing inventory management Then you could say that your actions are going to be what should I purchase at this this instant in time Your observations will be the current inventory levels and the rewards will be how well is your business doing overall?