 This morning, we have a lecture on hierarchical reinforcement learning. I'm very happy to introduce Professor Anders Johnson. He's a full professor in this university, University of Pompel-Fabra, at the ICT department. He's also leading the research group in AI and machine learning, which is the group behind this event, this year's edition of the reinforcement learning summer school. In his research work, he combined reinforcement learning and planning ideas in many successful ways, but he also developed algorithms to improve the efficiency of learning, especially exploiting the internal structure of decision processes, which could be of many kinds. One group of these kind of structures are the ones considered in hierarchical reinforcement learning, which is today's topic. Please thank Anders together. Okay, thank you very much for the introduction and good morning everyone. At the time that I prepared my slides, the program had this interesting feature, which was this bright orange coloring coding for my lecture. So the last time I checked this was not there anymore, but the explanation I got was that this was considered an advanced topic, and I could not be here next week, which is when most of the advanced topics are happening, so that's why we're here today. Okay, so just to say before I start, hierarchical reinforcement learning is a very big topic to talk about in 90 minutes, so of course necessarily I have to make choices of which works I find the most interesting or the most important, but there are many, many more works in this field than the ones I will mention here today. So this is an overview of my lecture. I will start by giving an introduction to the idea of hierarchical reinforcement learning, and then talk about the existing formalisms that people tend to use. Then I will talk about the theory, the most solid theory I would say of hierarchical reinforcement learning, which is the theory of semi-mark of decision processes. Then I will talk a bit about theoretical properties of hierarchical reinforcement learning, or the lack of such properties. And finally I will talk about two maybe a bit more advanced topics in the sense of, right, the first one, sub-task discovery is about the case in which nobody gives you, so often when you apply hierarchical reinforcement learning, someone gives you the sub-task structure that the agent can exploit, but in sub-task discovery the agent has to discover on its own what would be good sub-tasks to have. And the final one is on transfer learning. So because in hierarchical reinforcement learning we solve so many sub-tasks, there's a lot of opportunity to transfer knowledge between sub-tasks. If I have one sub-task and I'm faced with a similar sub-task, I can take advantage of some of the knowledge I already gained. Okay, so to start with a motivation, some of the main challenges in reinforcement learning are the ones I've listed here, and in particular some of these challenges are precisely those that hierarchical reinforcement learning can help address in different ways, which we'll see during the lecture today. So one important challenge right is sample efficiency. We'd like for the learning agent to make the most use of the data. It interacts with the environment in some way and it gains experience, and we want learning to be as fast as possible based on this experience. A related challenge is to scale up to complex decision processes with high-dimensional states and action spaces and so on. A third one is the idea of abstraction. So often the agent receives a lot of information about the environment, but much of this information might be relevant for the task that the agent has to solve. So then abstraction tries to focus on the relevant parts of the state and potentially the actions as well to simplify the problem representation. If we make the state space smaller, naturally our learning problem becomes simpler. And generalization, this is related to transfer that I talked about, but not necessarily only transfer between problems. Also we can have generalization between states. So if I've been in a state before and learned what action to take there and then I face a similar state, maybe the same action will be good in that state also, or maybe not, sometimes not, but trying to generalize knowledge from across situations. Okay, so hierarchical reinforcement learning can really be viewed as an instance of divide and conquer, which as most of you probably know is one of the oldest algorithmic ideas. So I assume most of you or all of you know about things like the merge sort algorithm. We're given a collection of items that we want to sort. And the strategy for sorting these items is to divide the collection into two parts. And now we have two problems which are also sorting problems. So the problems are of the same type as the original problem. Then we recursively sort these two parts. And then we have a merge operation for merging these two sorted collection into a single sorted collection. So divide and conquer works in this way. You recursively divide your problem into two or more smaller subproblems of the same type. And then we solve those subproblems. And then we have some way of combining their solutions to give us a solution to the original problem. As we will see, in certain situations we can view hierarchical reinforcement learning as an instance of divide and conquer. So let's look at an example sequential decision process, a simple one. So it's this grid world where an agent has to navigate from an initial location in the top left room to this goal location marked G. And so it has to learn a policy from moving from the initial location to the goal location. Possibly the actions could be deterministic or the action might fail with some small probability. It doesn't matter for this example. And typically the agent here will select between these primitive actions. Now go north, west, south, or east, or if you want to call them up, down, left, right. And some of the challenges here is that when people model this problem, typically you only assign reward when you reach the goal state. And well in this case the goal state is not that far from the initial state, but I could make this problem much bigger and make the goal state be much further away. So then a challenge here is the long horizon of the problem. I have to take many, many of these primitive actions in order to reach the goal for the first time. So it becomes a hard exploration problem. There are exploration techniques for helping in such situations. I think there's a lecture on exploration on Friday. Okay, but so what would be the approach of hierarchical reinforcement learning to this problem? Well the approach would be to decompose this problem into a set of sub-tasks. So in this case we can define the sub-tasks as going to the doorways between the rooms. So from the top left room I go to the bottom doorway. From there I go to the doorway, the next doorway to my left, etc. Okay, of course each of these individual smaller problems are also sequential decision processes. Okay, so I have to model those in some appropriate way. But if I look at the problem now you see that there are several benefits of doing this decomposition because before I had the long sequence of actions for getting to the goal. And now instead I only have to chain together four sub-tasks to reach the goal. So I've effectively shortened the horizon of the problem. And in each sub-task can be thought of as kind of a partial progress towards the goal even though probably we could have sub-tasks that take me away further away from the goal as well. So at the top level I still have to, the policy at the top level still has to select correctly among which sequence of sub-tasks to execute in order to achieve the goal. So then solving the overall tasks can be thought of as solving the sequence of sub-tasks. Okay, so how do we represent these sub-tasks? A lot of what we'll talk about in the next two parts is about how to represent such sub-tasks and how to combine the solutions of these sub-tasks to obtain a solution to the overall task. But briefly, each sub-task is also sequential decision process. So for example, when I start in the top left room and I want to reach the bottom doorway I can think of that as a smaller decision process, the one on the right where I'm only now acting in this smaller space, this one room and my goal now is not to reach this far away goal but my goal is rather to reach the bottom doorway. Okay, so I just want to, I mean as another motivation I want to talk about how humans solve decision processes, right? So imagine that we have to travel to the airport, right? How do humans plan such a trip? Well, I mean, many of you might have read the reinforcement learning book and I think they talk about similar examples. I as a human, my basic actions is move my arm or move my leg and so on, right? So I could model this problem of going to the airport as this problem of taking these basic moving actions that I have as a human. But of course that would give me an extremely long horizon for reaching the airport, right? So that's not how we think about solving this problem. Rather, we naturally divide this problem into subtasks, right? If I have to go to the airport, I think about, well, if I'll take the train to the airport, well, first I have to go to the train station, then I have to get on by a train ticket, I have to get on the correct train. So I plan at this high level of abstraction and then each of these high level tasks become subtasks, then I then have to solve, right? Now I decide to go to the train station, now my problem is, well, how do I get to the train station, right? So of course I'm not a neuroscientist and it's very difficult to reason about exactly how humans make decisions, but I feel fairly certain, right, that humans tend to apply this form of reasoning, right? We break tasks into smaller pieces that are easier to solve and then we chain together such subtasks to solve a given problem. Okay, so before I finish the introduction, I will just mention some of the benefits now that people have identified for hierarchical reinforcement learning. So one I talked about, right, is to reduce the effective horizon of the problem because even though I might need many primitive actions to reach my goal or the region with high reward, by chaining together subtasks I might need much fewer subtasks to reach the same goal. I might also explore more efficiently, right, if I have these subtasks that lead me to doorways, then imagine that I start in the top left room and I only take primitive actions and I just don't know what to do, right? I'm not getting any reward for any of my actions in that place, so I just do a random walk, right? So random walk will tend to stay in the same top left corner with high probability, but if instead I randomly choose between options of going to doorways, I will much quicker end up much further away from my original state than what I do if I use primitive actions, right? So subtasks also help me kind of move away much quicker in some particular direction, which is of course defined on how I've defined these subtasks to begin with. Right, another benefit that I'll talk about later, right, is this idea of transferring learning between subtasks. If I have two subtasks that are very similar, I could reuse knowledge about one to solve the other. And another idea, right, is that, so what the last point says, improve sample efficiency through structured credit assignments. So what I mean by this is that in the original task, the agent only gets reward when it reaches the goal. But when I do subtasks, right, the subtasks get rewarded for reaching these intermediate points, right? So the agent actually kind of internally gets rewarded for going to doorways on the way to reaching the goal, right? Which helps direct the agent, even though, right, in the end, it doesn't know yet which direction it has to go. It still provides some intermediate points where you get credit for what you're doing. Okay, and another possible benefit here is if I know, well, if I'm given the information beforehand that the goal state is in one of these doorways, or hallways between the rooms, then exploring only by going to doorways, of course, also reduces the effective exploration space for where I have to look for the goal, right? So that will also make exploration much, much more efficient in this case. Okay, so then I will go into the part about existing formalisms. So the hierarchy reinforcement learning is almost as old as reinforcement learning itself, I would say. So most of the theory of hierarchy reinforcement learning was developed in the 90s by these people that I will mention now. So the first, well, and actually just again to say that many, many other researchers have proposed different forms of hierarchical reinforcement learning. And I'm only going to mention here the four most, what I consider to be the four most important ones that have been reused the most after this development. Okay, so the first one was called the feudal reinforcement learning. And in this framework, the consists of having, well, in this case, we don't think of one agent that solves many sub tasks, rather the idea is that we have many agents that each solve some sub problem and interact with other agents in some way. In particular, this is a navigation task, which is obstacle. So at the top level, you have a manager which is responsible for solving the whole task. At the second level, well, at the first level, at level zero, we have the overall manager. At level one, that manager has four sub managers that are each in charge of a portion of the state space. And in turn, these managers on the next level each have four other sub managers and so on. At the lower level, we call these managers workers because they have nobody working below them to give sub tasks to. And the workers are the ones that actually execute actions in the environment. So how does this work? Well, managers are responsible for telling their sub managers what the sub tasks should be. And then they have the power to punish and reward these sub managers on the next level. But of course, so this seems like they have absolute power over their sub managers. But of course, they cannot do whatever they want because they also get punished or rewarded by their super manager who has told them what to do. So they have to pay attention to what the sub tasks they have been given. And there are two important principles that the authors identified. So one is called reward hiding. So when a sub manager is carrying out its tasks successfully, it should always be rewarded for that even if its manager was punished for doing the wrong thing. Because the sub manager was doing the right thing, they completed the tasks that they were given. So their reward should be independent of whatever the manager was told to do. And the same if it doesn't achieve the goal, it should be punished in the same sense. And the other one is information hiding. So this is related to this idea of abstraction. So we should only give each manager enough information to solve its task. We don't need to give it more information than that. Okay, so the second important paper on archeological reinforcement learning was this framework called hierarchies of abstract machines by Ron Parr and Stuart Russell. So in this case, actually in all of the three frameworks that I'll talk about now, we assume that we're given an MDP to start with. That's the MDP we want to solve. But here the agents are in addition given these finite state controllers that you have on the right. So the finite state controller tells the agent what are possible action sequences to follow or policies to follow. So each node in the finite state controller, well, each of the round nodes represent a behavior. So I have a behavior follow wall. And follow wall is also going to be another finite state controller that selects between other behaviors. And then we have these choose points. So the square node in the automaton, which is the only place where the agent has to make a choice. So here it has to choose whether in this state I want to follow a wall or to back off. So they have this environment with these obstacles and the obstacles are intentionally made concave. So that if the agent gets into an obstacle, it really has to go in the opposite direction to be able to get out of this obstacle. So that's why the behavior, otherwise it could just follow a wall until it passes the obstacle. So the policy of the only place we need to learn a policy is for these choose states. Then the agent has to learn what to do in these choice states. And the key benefit you get from these finite state controllers is that they limit the possible behaviors of the agent at a particular situation. So the agent cannot choose fully between going north, west, east or south. It can just choose between these two high level behaviors. And that simplifies learning of a policy. And then I've actually reversed the temporal order of the last two frameworks because I'm going to talk more about the last one. So this framework is called MaxQD Composition. It was proposed by Tom Dietrich in 2000. And here again we're given an MDP. And given this MDP we're going to specify a set of tasks. And one of these tasks is going to act as the root at the top, M0. And each task is a tuple that consists in a set of terminal states where this task terminates. A set of actions that this task can take. And these actions can either be primitive actions of the MDP or other subtasks. So each task has a subset of actions and subtasks that it can choose from. And which subtasks they can choose from is illustrated in this graph here. So the root task can choose between the two tasks get and put. Get can choose between the primitive action pickup or a navigate task. And the navigate task can choose between the primitive actions north, south, east, west. And the third component of each task is what's known as a pseudo reward function. So again, when you're learning to perform a task you need to be rewarded for completing that task in some way. Even if at the high level MDP there's no reward for completing that task. So each task has kind of its own reward function that governs what it should do or what its policy should do. Okay, and then the last framework is called the options framework. And I would say it's not really even arguable that, I mean, this has become the most popular framework for archival reinforcement learning in the literature. So it was introduced by Rich Sutton, Doina Precap and Satinder Singh, I believe in 1999. And here, again, we start with an MDP and an option is also a tuple that has several components. And the first component is the set of states where we can apply the option. This is called the initiation set. So IO is a subset of the states where we can apply the option. PIO is the option policy. So the policy chooses between actions. Well, this policy could either be stochastic or deterministic. Actually, here I've written this as stochastic policy. But in fact, I think in most of my other slides I assume that it's deterministic. So it selects one action in each state. And each option also has a termination function. So termination function is for each state, a mapping from each state to value between 0 and 1. And this value is the probability of terminating in that state. So the value could either be 0. I don't terminate in this state. It could be 1. I always terminate in this state. Or it could be something intermediate. So say it's 0.5. Then I have to sample a value, draw a random value uniformly between 0 and 1. And if it's less than the probability, then I terminate in that state. So this is what it says here. So the option can be chosen in any state in its initiation set. It repeatedly selects actions according to its policy. And then it terminates in a given state with probability beta O of S prime. So an option might run for multiple time steps until it comes to a state where it terminates according to my random poll. So as I said, I will talk most of the next couple of sections. I will talk extensively about the option framework. So before I finish this part, I want to, as I mentioned, there have been many attempts to define the idea of hierarchical decomposition and reinforcement learning. And I will say one problem is that there's been no strong common terminology even among researchers. So as an example here, there's a huge number of names that have been used for this idea of sub-problem. So we've already talked about task or sub-task and manager and worker in futile reinforcement learning. Sometimes I've seen the notation master and slave for manager and sub-manager. Option, we just saw some people call these temporally extended actions, macro action, activity, skill, behavior mode. So all of these essentially mean the same thing. They are not exactly defined always in exactly the same way. But the idea is that all of these represent sub-tasks that the agent can perform as part of achieving its overall task. Okay, so as I said, the most solid theory in hierarchical reinforcement learning I would say is the one on semi-mark of decision processes. So in this part, I'm going to explain this theory and how it relates in particular to the options framework. So actually semi-mark of decision processes is older than hierarchical reinforcement learning itself. So semi-mark of decision process was already proposed in the 60s, not long after mark of decision processes. And the idea here, so the only difference with respect to an MDP is that we assume that the actions have variable duration. And this variable duration is a random variable that is drawn every time you apply an action. So formally, the notation I'm going to use in the rest of the slides, hopefully consistently, is that an SMDP, I'm going to use this hat notation for an SMDP. So SMDP will also be labeled M, because it's really very similar to an MDP. And the state action and reward function and transition function are going to be S, A, R and P with this hat on it. So the definition is exactly the same as that for MDPs, set of states, set of actions, reward function. The only difference is the transition function. So the transition function now is for each state action pair. We have probability distribution not only over the possible next states, but also over the possible duration of action A, of an action when applied in a given state. So this notation here, the probability of reaching state S' in n time steps when applying action A in state S, it's the probability of transitioning to S' in this many time steps. And once we have this notation, we can set up a Bellman optimality equation, which I assume you've seen plenty of this week already. So to set up the Bellman optimality equation, we can define a state value in the same way as for an MDP. So I have the optimal value function in a state S. Can be written as the maximum over the actions I can take in that state, the reward I obtain for that action in that state or the expected reward, and then a summation over possible next states as prime. And now I also have to sum over the possible durations of the action. So I have to sum over N, and then put in this transition probability of transitioning to S' in n time steps, given SNA. And if I'm discounting, I have to properly discount the value of the next state as prime. So I have to discount N times the optimal value of S' if the action lasts for n time steps. And I can make this look even more like an optimal Bellman equation for an MDP by introducing this kind of pseudo transition function P tilde, which looks very much like the transition function of an MDP. It's just the probability of... Well, it's not an actual probability, as I'll tell you in a second, but it can be thought of as the probability of reaching a state S' given SA. So it just corresponds to this part in red from above. It's the summation over N, the probability of reaching S' in n time steps, and then times this discount factor exponentiated to N. So actually, this pseudo transition function is not an actual probability distribution because of this discounting we have here. But you can still define things this way and you obtain something that looks very, very much like the Bellman optimality equation for an MDP. And as you might imagine, if we have something that looks very much like Bellman equation for an MDP, it's not difficult to adapt all the standard reinforcement learning algorithms to SMDPs. We can do value iteration, policy iteration, Q learning, et cetera. So I'm going to show an example of Q learning in a bit. So really, I mean, you almost can think of having this extension to variable duration for free in a sense. So probably you have to pay a little bit of price of having to consider very long possible durations of actions. But the reinforcement learning algorithms do not become much more complicated when you work with SMDPs. Okay, so the key idea here is that if I have an MDP and I give you a set of options, as defined before, this defines a semi-mark of decision process. And in particular, I'll try to use this. So in particular, the SMDP that we get, so I have an MDP and a set of options O and the SMDP I obtain has some state space. The action set is, in fact, the set of options. So the action space here is O. And then R and P, the SMDP reward function and transition function are going to be defined as below. So S or this S hat is a set of states where the options are applicable, which might, of course, be the same as the overall state space S, but it could also be smaller. So you could do things like, I could define a set of options that only terminate in certain regions of the state space. So I will never consider action choices in some other states where options never terminate. So I could make the state space effectively smaller. And we're going to look at some ways of doing that later. The reward associated with applying an option O in some state S0 is just going to be the sum of discounted rewards. Well, it's the expected sum of discounted rewards when following the policy of the option. This pi O is the policy of the option. So when I'm in some state SI, I take the action prescribed by the option policy. Well, this is conditional right on starting in state S0. And these SIs are random variables denoting the state at time I. So the expectations is over these states. And the T is also random variable denoting the duration, the expected duration of the option. Which must vary from execution to execution. Okay, and the probability of transitioning to a state S' in n time steps starting from S0 and applying option O is just going to be the sum over all possible sequences of intermediate states S1 to SN minus 1. And then the probability of following exactly these sequence of states. It's the probability for each I from 1 to n minus 1. It's the probability of transitioning to the probability according to the MDP of transitioning I from S I minus 1 when following the option policy. Times the probability of not terminating in that state. Because we want the option to terminate after exactly n states. And then all of these times the probability of finally transitioning to S' from SN minus 1. Again, following the option policy and actually terminating in S'. Because that's how long you do in order to have exactly duration n. So the question. Good question. I actually had it written like that first. You could define S' as the union of the initiation sets of all the options. But you could also do so as I said you could also restrict the termination conditions of the options only terminating some subsets of states. So even if the options are applicable in other states you will never actually be in that state to apply the option. You could adjust the initiation sets to account for that. I think it would be fine to define it in the way that you suggest. Okay. Assume that the SMDP is given or reconstructed? Okay. So good question. The option set of options is given. So actually I'm going to come to this in just a couple of slides. Okay. Right. So the point I think the origin of this question is right here. We assume that the option set is given. Right. That includes the option policy. So someone has to tell us exactly what policy each option is following. Right. I'm going to partially address that as well in later slides. Okay. Okay. But right. So this is the important property that the original authors of options noticed. That I can enhance an MDP with a set of options and that induces an SMDP. Okay. So I'm going to illustrate this on the next slide but just to say also that this primitive action can actually be thought of as a special case of an option. So you can include primitive actions in your action set as well. Right. So a primitive action can be thought of as an option which is applicable everywhere or I mean some actions might not be applicable in all states. An initiation set will be all the states in which this action is applicable. The policy always selects and the action always terminates after one time step. So the beta function is always equal to one for each state. I take the action once and then in next state I always terminate. Okay. So this means I can include primitive actions in an option set as a special case of an option. Okay. So here what's shown here if we're in an MDP then the agent will take a decision at each time step. So the decisions are regular in the sense they all happen with the same kind of time interval. In an SMDP because each action might take a variable amount of time then after I take my first action choice I might have to wait for some time then I take an action make an action choice again then I wait for a different amount of time etc. I don't necessarily make decisions at each individual time step and in the case of options over MDPs I have a decision process on two levels. At the top level I have the induced SMDP. So the induced SMDP chooses among which options to apply. And each time so the SMDP policy will be the big circles. It chooses an option then it has to wait for that option to terminate which might take a variable amount of time and then the SMDP policy makes the next choice. But inside of each option we still take an action choice at each time step using the option policy. So you have a high level you have the agent acts at two time scales at the high level, at the SMDP level and that's the low level option level. So this relates to the earlier question. This is an example first. So I just wanted to put in something that you've seen before this week which is Q-learning. This slide shows how to apply Q-learning when I'm working with options. Assume that we apply an option O in a state ST then we keep selecting repeatedly actions using the option policy pi O and we check termination using this termination function beta O and then we assume now that the option terminates in state ST plus N after N time steps. So how do we apply Q-learning in this case? Well the first thing we have to do is to maintain a sum of discounted rewards. So while the option is executing at the low level we register how much reward we receive at each of those time steps and then we sum up the rewards that we got while executing the option. So this uppercase RT will be the sum right from K equal to 0 to N minus 1 of the reward that's time T plus K and each of them discounted by K. And now I can just state the update role of Q-learning with options as follows. It's the new value of the state option which is going to compare ST and O. It's going to be 1 minus alpha T times the old value plus alpha T the target where the target now is sum of discounted rewards I got during the option execution and then the maximum Q value in state ST plus N over all the possible options I can select there and discounted by N because the option lasted for 10 time steps. I have to discount appropriately by N times. Okay. So all the theory so this was the question that was asked before and all the theory so far assumes that the options are given and the options are given that includes the option policies. So someone tells us for all of these sub tasks that are presented by the options but of course more often than not the agent doesn't know what when you apply reinforcement learning the first time the agent doesn't know what actions to select. So what if option policies are not provided as prior information? Well as I said before a sub task is also a sequential decision process so what we can simply do is define a local MDP for that option and the option policy will be implicitly defined as the solution to this option MDP. So I learn an option policy from experience by solving this option MDP. So the option MDP I'm just going to use this notation again like the ordinary MDP but now with subscript O to indicate that this is the local MDP for an option and again the state space of the option might be smaller than the full state space as we will see. The action space might also be smaller than the overall action space. I might only allow the option to select among a subset of the actions of the original MDP and the transition function is basically just going to be the projection of the MDP transition function to this potentially smaller state action space of the option. So the option probability P O of S and A is going to be equal to P of S and A for the MDP. But then a question is how do we define this option reward function? So as we talked about we should define a reward function that makes the agent or this option solve its associated sub-task. So I'm going to talk, I'm not going to tell you yet, but I'm going to talk about some different choices for how to define such a reward function in the next few slides. So one idea to define an option reward function is to particular when we are interested in reaching a state or subset of states is just to define a reward function that assigns reward when we reach the correct termination state of the option. So in the example from before, if I'm in the top left room and my sub-task is to navigate to the lower doorway I put a gold state there and I give the agent a reward when it reaches this gold state, even though in the original MDP there's no reward associated for reaching that gold state. So in this case the option reward can be the independent of the overall task. So this is related to this principle of reward hiding that I talked about for feudal reinforcement learning that the option should be rewarded for solving its sub-task rather than just having the same reward as the overall MDP. But this has, this introduces some problems of its own that I'll talk about next. Okay, well first just to say that with the theory that I've now introduced we really achieved something that resembles or achieves divide and conquer. So it's an example of divide and conquer. I take an MDP I divide it into sub-tasks which are in the form of these option MDPs. So this is its own decision process so it's of the same type as the overall problem. Then I solve the option MDPs and then I combine their solutions by making an SMDP policy that chooses among the sub-tasks at a high level. So of course just like in MergeSort, the problem is not solved when I've solved the sub-tasks or the smaller problems. So I have to solve the overall task by computing this SMDP policy. Okay, well another issue is that if I do these two level learning at the same time at the top level I'm learning an SMDP policy and at the bottom level I'm learning option policies. Well the induced SMDP I showed you before assumes that the SMDP policy is stationary. But if I'm now changing the option policy over time that introduces an effect of non-stationarity at the SMDP level. Changing the policy of a single option will mean that the SMDP reward function and transition function will change over time. So non-stationarity makes the learning problem harder essentially. So in principle, learning option policies and SMDP policy at the same time is unstable or can be unstable. But many people tend to do this in practice and more often than not these converges. Even though, as I will tell you later, I don't know of a single work that proves convergence when you act when you learn policies at the two levels at the same time. That's a very interesting open question. I'm going to come back again to this idea of local rewards or these option-specific reward functions. The idea is that the hierarchical structure might prevent you from finding the optimal policy for the original MDP. And Dietrich in his work on the composition already defined two novel notions of optimality for hierarchical reinforcement learning. So I'm going to explain these two notions using these two figures. So the first notion of optimality is called recursive optimality. Recursive optimality means that each option policy is learned optimally given the definition of the option MDP of each option. So each option policy is locally optimal and the SMDP policy is also optimal on top of these option policies. So the example on the left here shows that this does not always correspond to a globally optimal policy. So here in the left example, assume that there is an option in the left room exiting the room using one of the two doorways and if we reward this option equally for going through either door then the policy we obtain is the one showed here. But in particular the black arrows show states where we are taking actions that are locally optimal for this policy because the closest doorway is the one at the bottom but it's not an optimal action with respect to the overall goal which is to reach the goal in the top right corner. So recursive optimality each option is locally optimal but we might not have a globally optimal policy. And hierarchical optimality means that given the hierarchical structure that we're given, the set of options the policy is optimal with respect to this hierarchical structure. So what do I mean by that? So in the left example hierarchical optimality would be achieved simply by changing the direction of these black arrows to go up. The option could act this way there's nothing that prevents the option from going up instead of down. So to obtain a hierarchical optimal policy the option would need to go up to six states but on the right I put another example so let's assume the agent here is in the middle in the middle square and in these middle squares the only options available to the agent is to go to one of the four corners so one, two, three or four and let's say that from four it could take another option that goes to the goal state but this option is not available in A so the hierarchical structure I've imposed by defining these options prevents the agent from going directly to the goal state so the best I can do given the hierarchical structure is to go to the corner four and then apply the option for going back to G and clearly this is not an optimal policy for the overall MDP So so for for defining the option reward what the option reward should be there are two extremes you can consider so one extreme as the one we've seen where we just reward the option equally for reaching any terminal state so we would give the same reward to the option for reaching the upper or the lower doorway to achieve this policy that we saw on the previous page another idea that we might consider is I call this uncoupled because this reward for the option this local reward for the option is unrelated to the high level task we have to solve which is to get to the upper right corner and this is the reason why the recursively optimal task page is not globally optimal another idea would be that you might consider is to make the options fully coupled in the sense that the reward I will give to the option in the left room for reaching one of the doorways is going to be the value of the option in the right room for reaching the goal so this seems great because then I will typically get a higher value at the top door because I'm closer to the goal and then likely the option in the left room will decide to go up instead of down in these problem states okay what's the problem of doing this well if you do this then you're not really taking advantage of hierarchical decomposition anymore because you're learning a policy you're learning an action value function or value function on top of all states again right because the option in the left room will be bootstrapped based on the option in the right room so I'm not really simplifying learning because I'm learning a policy or a value function over all states again right but a very interesting research question I think which nobody has fully solved I would say is to find some intermediate between these two because we would like here if our goal is to reach the top left the top right corner we would like the option in the left room to prefer the top doorway in some way but without fully coupling so that we have to solve the full problem again so I think this is a very interesting open research question okay so before finishing this part I mean now we've seen that if I want to apply hierarchical reinforcement learning to my problem the number of design choices I have to make when I design the problem I have to potentially decide which options to include what subtasks this option should solve by defining the termination function appropriately I need to find out whether I can achieve these option policies prior to learning or I have these option policies available as prior knowledge or not if I typically I don't have them so I have to define these option MDPs in some way so I can learn the option policies from experience I have to decide whether I should include primitive actions in the option set I'll come back to this in the next part and also I mean to simplify or really take advantage of hierarchical decomposition I would like this SMDP to be simpler to solve than the original task that's the whole goal of divide and conquer is to make the problem simpler right so one thing I will talk about in the next section is state abstraction how can I simplify the problem at the high level hi thanks so I'm curious I don't see termination probabilities like the beta O's on this slide is that not a design choice as well like something like telling me that it's terminated once I've hit one of the doorways in that original picture right I mean which sub tasks should they solve that's where I'm where I mean what sub task I should solve is determined by how I define this termination function so that it comes it's implicit in that yes right but you're right I mean which sub task should I solve maybe a high level question but then I have to actually go in and say what should the termination function be in each state so you're right I mean it might when you said on an earlier slide check using beta O whether the termination condition has been met is that usually going to be defined in terms of some external like measurable event in the underlying MDP right I mean so I explained this before right so if the termination function is zero then I never terminate if it's one I always terminate but if it's intermediate I have to roll a dice or I have to sample a random number right so if the probability of termination in a state is 0.5 I have to okay but typically no I mean it's yes this simple simple mechanism okay okay so I've actually gone quite a bit over time on that part I think so I'll try to go faster on the rest so I think someone asked this question or how do we measure the complexity of learning well the complexity of learning depends on or the sample complexity of learning depends on several parameters but I mean first and foremost it depends on the size of your state action space right it also may depend on things like the mixing time or the diameter and so on which I'm not going to talk about but let's focus on the size of the state action space right so to really simplify learning when we do hierarchical decomposition we would like both the SMDP and each option MDP to be smaller than the original MDP I mean then we really achieve simplification we would then we we obtain smaller decision processes that are simpler to solve than the original task okay but this to desire properties are opposed in the following sense right on one extreme I could define only a single option and in that case if I only have a single option that option has to do all the work so it has to its policy has to correctly maximize reward right so so if I have a single option then the SMDP learning problem is trivial always select the single option but the option MDP is going to be roughly as complex as the original MDP because that option has to learn about all the MDP dynamics on the other extreme I could have options that just act in a single state right so each option only acts in a single state so they are effectively like primitive actions so then the option MDPs are trivial to solve or just like multi-armed bandits but the SMDP decision process now is as hard as the original MDP because the SMDP has to decide in each state which action to take so ideally we would like to find a trade-off that gives you a reasonably sized SMDP and also reasonably sized option MDPs and as far as I know again nobody has ever quantified what this trade-off should be to maximize the benefits of hierarchical reinforcement okay, what about including primitive actions the good thing about including primitive actions is that I will always be able to obtain the optimal policy for the original MDP because I could always select among the primitive actions and ignoring the options but then the size of the SMDP now is going to be at least the size of the MDP so I don't gain any benefits in this sense there might be some other benefits that I get but not in terms of the size of the state action space okay, so one topic I wanted to talk quickly about at least is state abstraction so the idea of state abstraction is you take your state space and you map it onto a smaller space which I call Z here the principle of information hiding only give to each manager or each sub-task the information that you need to solve that sub-task and then you compute a policy for the smaller abstract space and ideally you should then be able to translate this policy back to the original space okay and because when we're doing hierarchical reinforcement learning because there are so many sub-tasks and tasks then there are many opportunities for state abstraction right so ideally you could do state abstraction both at the high level and at the low level okay, so the aim here would be to accelerate learning both of the SMDB policy and of the option policies so one setting in which common setting in which you can achieve state abstraction is when you have a factor representation I don't know if anyone will talk about factor representations in one of these lectures but the idea is that actually in most realistic reinforcement learning tasks the state is not a black box the state consists of a set of values drawn from some set of state variables so you have D state variables or D features if you want and the key idea here is that some of these state variables might be sufficiently independent and if you know which so here I'm not going to tell you what the example is but if state variables are independent that means their value at the next state is independent of some of the other state variables at the previous state and in particular if I choose a subset of state variables that introduces an abstract space which is the cross product of only the chosen state variables and it's extremely easy to project a state onto the smaller abstract state I just ignore the state variables that I left out the limitation here is that you have to know what these conditional independence is which is not an easy problem to learn from experience so Dietrich identified five different types of what he calls safe state abstraction so safe state abstraction or types of abstraction you can apply specifically in hierarchical reinforcement learning and does not affect the optimality of the policy so the names here are not so important so he gave these names that can be a bit difficult to understand but the main idea is to use this factor representation if I solve a sub-task then and I know that the sub-task only depends on the few of the state variables I can ignore all the state variables that are irrelevant for solving this sub-task there are some other cases like result distribution irrelevance which says that if I take an option in two different states and I have exactly the same probability of reaching a terminal state then I don't need to keep separate value functions for these two states I can treat them essentially as the same state with respect to this policy at this level or I might the last case says that if I know that an option will never be applied in a certain state then I don't have to keep an action value for that state option combination for example so I just put so remarkably I think this is a very big opportunity to do theoretical research in archaical reinforcement there are remarkably few theoretical results so there are a few convergence results that were proven actually by the original authors but all of these convergence results assume that the option policies are given so as I said before I don't know of a single result that proves convergence when you're learning at both levels at the same time that's proof for the average reward setting from a couple of years ago and few authors have also looked at sample complexity and regret for archaical reinforcement learning so well I'm not going to have time to go into the details here but the sample complexity a particular form of sample complexity the regret so I just actually found a theoretical search so I found a recent result on regret from this year which I haven't read in detail but all the previous works again assumed that the option policies were given which is a limitation the paper from this year actually learns on both levels of abstraction but it learns the option policies first completely and only then learns the SMDP policy so it does not learn both okay okay I think I'm doing okay on time so in the next section I'm going to talk so far the theory assumes that at least the option MDPs are known I know what the sub tasks are that I want to solve but what if the agent starts out without having any knowledge of what the sub tasks should be so in this case what we can try to do is to discover sub tasks either from experience or from some problem structure that we have so as we talked about the sub task is encoded in the termination function of options and partially also in the reward function that tells you which termination states are good and which are not good so the advantage of doing sub tasks is that even if you've just given an MDP and nothing else I can still try to take advantage of hierarchical decomposition by finding sub tasks myself of course the drawback is that I'm making the learning problem a whole lot harder because finding what the good sub tasks should be is a difficult problem on its own so often I would say intuitively I would say it's probably only worth making the effort to learn a sub task structure if I'm going to solve multiple tasks in the same environment otherwise I might be better off just solving the MDP if I'm only asked to solve the MDP once then probably the effort of discovering sub tasks is going to be equivalent or similar to actually solving the original task so there were many early approaches for sub task discoveries was actually also part of my PhD thesis many years ago so all of these have the limitation that they were mainly for the tabular setting just quickly some of these ideas so some authors looked at what's known as bottleneck states or landmarks so if the agent always has to go through a state to achieve high reward then this state becomes a good candidate for a sub task so you try to find states where the agent almost always passes through and you introduce sub tasks that take the agent to those states so my dissertation was on analyzing factor structure so similar to what I showed you before if I know, if I have this structure of conditional independence between state variables I can use that to identify which are useful sub tasks that are possible to apply some authors looked at they built a state graph if my state space is small enough I can build a graph and then analyze this graph in some way so I think Menach et al they tried to find a min cut so they tried to find to divide the graph into two parts using as few edges as possible and then those partitions become the sub tasks these policy fragments means if I solve many tasks in an environment and it happens that many of the policies are the same in some of the states I can make those kind of policy fragments into sub tasks detecting novel states is similar to these bottleneck states while I explore I revisit states a lot but then I happen upon kind of a new region of the state space many of the states I've never seen before then probably these novel states represent parts that I have to go through to reach high reward areas people tried clustering which again requires representing the state space explicitly and skill chaining means you start with some region that you know that you have to go to and then you learn options you kind of perform regression so you learn options that reach these target states you try to learn other options that reach the initiation sets of these options you kind of chain skills backwards that are capable of reaching the target area okay so more recent work there's this work called the option critic architecture which was proposed six years ago so option critic is inspired by action critic which you haven't seen yet which I think which I've been told you will see later this today it's not important to understand the main idea at least so the main idea here is to formulate an objective function for SMDPs with a fixed number K of options and then the authors derive a gradient not only with respect to the SMDV policy but also with respect to the policy and termination function for the individual options and then we can just optimize the parameters of the option policy and the termination function using gradient descent so the architecture looks something like this so we have the bottom is the environment the agent sends an action we get back a state and a reward so very informally a critic is basically maintaining a value function which criticizes how well the critic is doing and the actor or in this case the options are the ones responsible for selecting actions so the reward gets passed to the critic the critic actually maintains value functions both for the SMDP and for the individual options and then we use gradients from those critics to update the option policies the option policies and termination functions here and also the SMDP policy which is the one choosing among options so the main contribution here was to learn the subtext structure both the termination function and the policies of the options using something like deep learning because you can do this using stochastic gradient descent you can learn all of these simultaneously so it achieves these simultaneous option policies and the SMDP policy but it has several limitations so one limitation is that if you leave it on its own it tends to discover that the optimal thing to do is to apply primitive actions which is natural because we know using primitive actions we can always represent the optimal policy so the authors have to specifically prevent options from terminating in one step in order to get more meaningful sub-tasks this also tends to create options regions that are not strongly connected intuitively we would like an option to act in a region that's strongly connected where the sub-task is to reach some certain target and because the regions are not strongly connected it's difficult for a human to interpret what the options actually do and my personal reflection is probably that in almost all cases of hierarchy reinforcement learning it's efficient to have a termination condition which is either 0 or 1 either a state is a terminal state or it's not so having intermediate values is precisely what allows them to do gradient descent here because it's much easier to do to compute the gradient of a continuous variable but it's also the reason why you have something so expressive that you can terminate with some probability that what you get is potentially not that clearly defined what the sub-task structure is another recent work is something called Eigenoptions which was proposed by Machado and co-authors the idea here is to exploit something called successor representation successor representation of a policy basically measures occupancies of future states given that I start in some states so the successor representation starting from a state S of a given state S prime is well how many times what proportion of the time will I be in state S prime when I start from state S and follow some policy and often if I'm using discounting I will discount these occupancies appropriately according to the time step when I'm in this state S prime and previous work had shown that the Eigenvectors of these successor representations give us something called proto-value functions and the idea of Eigenoptions is to define options whose reward functions are such proto-values I'm going to illustrate this a bit better on the next slide so here's a simple grid world again this four room environment assume that so I think in this case they just take a random policy so I explore this environment using a random policy and I build I learn this successor representation and then I compute the Eigenvectors of this successor representation these images show the first three Eigenvectors so you can see the first Eigenvector prefers being in the top right room and avoids the bottom left room and the Eigenvector likes the top left room and avoids the bottom right room and so on and the idea of Eigenoptions is to use for each of these well each of these Eigenvectors have a well defined maximum and minimum so for each Eigenvector we can introduce two options one that tries to go to the maximum value of the Eigenvector the Eigenvector has one value per state some state which has maximum value so the option will try to go to that the state with maximum value and we will define that state as a termination state and then we can also define a second option which tries to reach the state with minimum value and another benefit of this is that actually here we don't just get reward when we reach the terminal state right we get because of the each state has a proto value this gives us an effect of reward shaping so the agent kind of gets some help in knowing which direction it should go to reach the termination state and of course all of this is so far for the tabular setting as well which doesn't scale that well but instead of using the successor representation you can use something called successor features which essentially does the same thing but in a given feature space so then you can get something that scales better than having to act on a representation where you treat each state separately so again the main contribution is to simultaneously identify sub-test structure what should the termination state be and what should the option reward function be the limitations are that well first of all successor features are specific to a given policy so if I improve my policy and I want successor feature for that policy I have to compute the successor features again and successor features are essentially as expensive to compute as a value function because you can you can state the successor features you can state bellman equation for these successor features right so you have to do a lot of work to learn these successor features but the limitation is that it ignores reward right it just discovers the dynamics of the problem but potentially where the state I'm interested in reaching might not correspond to the termination states of the options that I get when I do this okay and a related idea is something called covering options which is also based on the eigenvector with smallest eigenvalue of the successor representation and these covering options actually only move from one single state to another single state and intuitively you can think of these options as trying to reduce the horizon the diameter of the problem as much as possible the diameter of a problem is what's the what's the longest distance between two pairs of states I'd like to introduce some option that reduces the diameter of the problem because the diameter also impacts how how fast we can learn a problem so we try to introduce options that reduce the distance between states as much as possible so they might look something like this but the limitation here is that each covering option is only applicable in two states right either I go one way or I go the other way so they have very very small initiation sets right so okay and finally there is this formalism called reward machines which was proposed by Torrey Cartel from Toronto where the reward so the idea is to describe the reward using a finite state automaton right so and where you define some set of high level events so in this example there's the agent is the triangle and the agent can interact with these objects each of these objects here is considered a high level event and on the right is a reward machine that tells you what you have to do in order to get reward so in this case you have to the you have to get so this object is a sugar cane and the other one is a rabbit we have to take a sugar cane bring it to the white table get a rabbit bring it to the white table and then go to the black table not to in the idea is to you need these resources to produce something and you can get the sugar cane and the rabbit in either order you can either get the sugar cane first or the rabbit first and several authors have shown that you can learn so the relation to hierarchical reinforcement learning is that each of these edges each of the edges of the reward machine can be thought of as a subtask so getting the rabbit means I have to navigate to trigger the rabbit event which is when I step on it so I can solve a problem if I have a reward machine description of a problem I can solve it using hierarchical reinforcement learning by making these local policies that achieve these subtasks and many authors have shown how to learn how to practically learn reward machines from traces of high level events so this is also a form of subtask discovery I interact with my environment I see which sequences of end trigger reward and then I build this reward machine and that provides a subtask structure for doing hierarchical reinforcement learning so I'm almost out of time so I'll try to go quickly for the last part so the last part is about transfer learning so the intuition is that because in hierarchical reinforcement learning we have a lot of subtasks there are a lot of opportunities to transfer knowledge from one to the other I would say the simplest form of transfer learning that people proposed already back in the 90s when people started applying hierarchical reinforcement learning is when I have two options that act on the same state space so here are two options that act in this top left room in our example one has to reach the right doorway and the other one has to reach the bottom doorway and if I sample a transition using one of the options then this transition could have been sampled also using the second option so I can use it to update the policy of the second option as well so this is called intra-option learning and this tends to speed up learning of the option policies because you reuse the experience accumulated by one option to learn about another option policy what people have studied more recently is something called gold conditioned policies where the idea is simply to encode the goals so I might have a set of states that are possible goals for the agent or you can think of this in hierarchical reinforcement learning as possible termination states that I might want to go to like in the case of the doorways I might want to reach the right doorway or the bottom doorway then I can encode the goal as part of the state and try to learn a policy from state gold pairs to actions what action should I take given by state is this and my goal is right now is this if I'm in the tabular setting it's equivalent to learning separate policies because I have to learn a separate policy for each individual gold state but if I'm doing function approximation I might be able to learn a gold conditioned policy more efficiently than having to learn a separate policy for each of these goals so many authors have studied this in the context of hierarchical reinforcement learning and the limitation here is that of course it's harder to learn a policy that can reach a number of goals because I make the problem harder something that I think is a promising direction at least of hierarchical reinforcement learning is to try to make subtasks that only act on a subset of states we want subtasks to be easier to solve so one way to achieve such options is to consider partitions of the state space so if we partition the states according to the rooms in this example then each option only acts on a smaller part of the state space and at the top level under some conditions the SMDP policy doesn't need to care about where in a room I am so I can abstract at the high level and say the only states of the SMDP are going to be which room I am in which option do I take in each room so the benefit of this is I achieve simplification both at the SMDP level and at the option level so I really made the problem easier of course so this also means I would like some properties of the partition I would like the partition to be more or less the same size everywhere so learning a partition is not particularly easy so there have been some examples of researchers trying to learn partitions from experience a related idea which I didn't put here is to again identify some landmark states so I might put some landmarks in my environment and I have options for reaching those landmarks and each option is only close to that landmark I don't care about reaching the landmark from far away so I have to kind of step through the landmarks to reach some far away point so there are several works along that line as well which achieves a similar effect each option acts on a smaller part of the state another idea that was proposed recently is to exploit something called compositionality so if I have already a set of reward functions and I've learned policies for those reward functions and I'm given a new reward function which can be expressed as a weighted sum of the existing reward functions then I can exploit several of the ideas that I mentioned successor features and something called generalized policy evaluation to estimate a policy for the new reward function without learning this is not going to be an exact or an optimal reward policy for this reward function but it will be a good approximation and people use the option keyboard as an application of this idea to hierarchical reinforcement learning okay, yes for partial models what do you mean by partial models so when you have to when you have models that predicts features and states conditioned on some features of the state okay well so this can be used in the settings I've described there probably it could also be combined with features in that way but okay so reward machines you can also do transfer only to put to this again because if you have two if I have this in the reward machine that requires me to go get the rabbit of course I can reuse the policy from one to the other so I can do transfer learning in reward machines as well something that you can also exploit if there are some partitions of my state space that are exactly the same dynamics like the three middle rooms the column of the three middle rooms have exactly the same dynamics so if I learn a policy for reaching the right door I can reuse that in the other rooms as well so of course in this example having exactly the same size of the rooms is a bit unrealistic but when you're using factor representations you can automatically get these equivalent classes if you have this conditional independence okay so this is a more realistic scenario in which you can exploit such such equivalence I'm going to go quickly because I'm out of time so just the last thing I just want to mention one work that I've been doing with a PhD student which is similar idea to the options keyboards but for a particular class of MDPs called linearly solvable MDPs in which the Bellman optimality equation is linear and because the Bellman optimality equation is linear I can apply a form of compositionality in which I have if I have a set of boundary states and I learn policies for reaching each of these boundary states and I now give you another task which with arbitrary reward on the boundary states then I can achieve an optimal policy for this new problem without learning okay so for LMDPs because of the linear common equation this is optimal and we've exploited this to recently to obtain a formalism of hierarchy reinforcement learning where so in this example we have four rooms again and on the right is an equivalent subtask which well it's equivalent of all of the rooms so we define these five subtasks for sitting the room in each direction and for reaching a given goal location if we solve this then on the bottom level we solve these five subtasks then on the top level we only have to learn a value function on the green states okay and the optimal value of all other states will be determined by these subtasks okay and I'm really out of time but an interesting features of this I think is that unlike all other forms of hierarchical reinforcement learning in our work the high level policy never actually selects an individual subtask to apply instead the value functions are composed by the value functions of the subtasks and then the policy acts according to this value function okay so I think this is an interesting direction to explore okay well so just to summarize I mean there are some limitations of hierarchical reinforcement learning there hasn't been a lot of unified terminology, common benchmarks it's a big problem I think when people write papers they all work on different problems it's hard to compare to previous work there are no killer applications the only application I've seen is Starcraft and there are few theoretical guarantees but on the other hand there are many open research questions I think there's a lot of opportunity for research in this field okay I will stop there thank you Anders