 All right. Hello, everybody. If you're listening to this lecture, you're either in CIS 419 or 519 or in CIS 522. And if you're in the 519 class, then you know who I am. But if you're in 522, you will have recognized that this is neither Lyle nor Conrad teaching this class. Instead, I will be handling reinforcement learning. And my name is Dinesh Jayaraman. I'm an instructor. I'm an assistant professor in the CIS department. And I work in computer vision, machine learning and robotics, including applications of reinforcement learning for robotics. All right. So by now you're familiar with these different types of learning settings. In particular, we've seen supervised learning, which involves typically the setting where you're given some training data and some desired outputs for that training data as labels. And you try and fit a function that's able to produce those same desired outputs, hopefully in such a way that you can generalize. We've also seen what you can do with unsupervised learning where you're given just the training data without any labels. And we briefly discussed what it means to do semi-supervised learning, where you have some training data that has labels and some other training data that doesn't have labels. And so reinforcement learning is quite special. In reinforcement learning, you no longer just have sort of a data set that's given to you in advance. Instead, you receive your data as the agent performs some sequential actions in an environment. And that process of performing sequential actions generates some observations and is accompanied by reward. So there is no label per se, but the reward essentially tells you whether the actions that you performed were good or not. Now, I don't expect that at this point things are very clear yet, but we will digest this over the next few slides so that hopefully you'll have a clearer understanding of what the reinforcement learning setting is. So broadly, the aim of reinforcement learning is to make sequential decisions in an environment. So for example, if you're driving a car, you're making sequential decisions at every instant about how to steer your car, whether to put your foot on the gas or on the brake, and so on and so forth. So those are examples of decisions you're making sequentially in the environment. Each decision is affecting something in the environment. It's affecting the position of your car at the next instant. It's affecting the people around you, the other vehicles on the road. And it's going to, through that effect that your action has, it's also going to influence your future actions. So that's one example of how you make sequential decisions in an environment. But of course, we do this all the time. So you might be cooking, or you might be playing a video game, or you might be controlling a power plant, maybe changing some settings in a power plant, coming up with a medical treatment plan for a patient, or just trying to display online ads to a user. And all of this typically involves some kind of sequential decision making. And the question then becomes how do we learn to make those sequential decisions? And the RL answer is to assume that you have occasional feedback. So for example, if you're learning to cook, then your feedback comes from whether your meal is tasty or not. If you're learning to drive, your feedback comes from whether you've crashed your car or gotten home safely. Or maybe from an occasional honk from your nearby drivers on the road. And if you're learning to play a video game, then the points in your video game give you some feedback. And so RL assumes that you have this occasional feedback. And then the learning paradigm in RL is to use that feedback to learn through trial and error. So obviously you try out a bunch of things and you see what gives you the best feedback through this occasional feedback mechanism that we haven't quite covered in detail. But once you have that, then you have to kind of make use of that experience as cleverly as possible so that even though you're learning through trial and error, you have to do this as cleverly as possible to minimize the number of trials you need in order to be able to learn something. So having seen in English the sequential decision making setting that reinforcement learning operates in, let's start kind of formalizing this a little bit. And in particular, we will assume this abstraction of an agent and an environment. The agent is the thing that's going to learn to perform some actions. And it's operating in a setting that we'll call the environment. So for example, a robot agent could be operating on a tabletop where it must grasp some objects or it might be operating in a kitchen where it should perform cooking. Or you could have a self-driving car agent that is operating on a highway. So remember, this is a sequential decision making problem. And we are trying to perform some action in that environment like cooking a tasty meal or driving safely. And at every time instant in the sequential decision making problem, the agent starts by looking at the state of the environment. And that state of the environment is going to inform the actions that it will execute. So at every instant, the agent observes the state of the environment and it might also get feedback from the environment on how well it's been doing. So that feedback is going to be called the reward RT. So we're going to denote the state of the environment as ST and the reward as RT. Now it might not always be the case that it's easy to observe everything in the environment that is important for making the decision. So for example, it might be that a pedestrian on the street is hidden behind a car right now and you're not able to observe them very well. And so it might therefore be that you don't get to observe everything in the environment that is important, you don't get to observe the full state of the environment and you might instead just observe some features of that state, some function of that state. And this might include some aspects of the state actually being omitted. Like in that case with the pedestrian being excluded. So once you've done that, once you've gotten a reward RT from the environment, you've observed the state ST and remember the reward is only occasional. So we don't always get a reward from the environment. There might be time instance T where there is no reward. But once we've done this, then the next step is the agent at time T emits an action into the environment or executes an action at in the environment. So you can think of this as kind of a turn by turn kind of abstraction of acting in the world where the agent first executes in one turn a particular action and then it gets to get some feedback and it gets to see how the environment changed in response and then in its next turn it executes another action. And so at the next instant again it's going to receive an updated state ST plus one and a new reward RT plus one. And remember we said at the beginning that the agent's goal in doing all of this was going to be that it wanted to make a tasty meal or it wanted to drive a self-driving car safely. Now the agent's goal is to, so we've said that in English but the way that goal is actually expressed to the agent is through this reward, the feedback. And so the agent's goal in reinforcement learning is always just to maximize the expected rewards. So the reward is really important because it's really the reward that specifies what constitutes a good execution of a task and what does not. So the agent's goal is going to be to maximize the expected rewards. And in particular typically we will abstract out this process of learning a particular skill in the environment like cooking or driving into a policy function. A policy function which is a mapping from a state S, remember the states are the configurations of the environment. So it's going to be a policy function that's a mapping from the state S which belongs to a set of states capital S that the environment can be in. So it's going to be a mapping from a state S to an action. And an action A is going to be in the set of actions that the agent can take. And so this function that's going to learn, that's going to represent the mapping from states to actions, the optimal mapping from states to actions is the output of reinforcement learning if you will. So at the end of reinforcement learning, the agent should have learned a good function Pi of S, which we'll call the policy function. All right, so how is reinforcement learning different from what we've seen before? Let's recap kind of what we've seen on the previous couple of slides. We've seen that in reinforcement learning there is no notion of supervision and instead supervision is replaced by occasional rewards as feedback. The second thing that we've learned about our problems is that it operates in the sequential decision-making setting. And because it operates in the sequential decision-making setting, it means that data is going to be generated as sequences, because we saw ST and then ST plus 1 are all generated in sequence. So it's not really IID sampling from a distribution like we were assuming earlier. It's not independent and identically distributed like we've been assuming so far when we've been doing things like supervised learning, but instead the data is going to be generated sequentially. And thirdly, the training data is actually generated by the learner's own behavior. So when you're learning to drive, you are going to observe, you don't have a repository of training data that is waiting for you to optimize, to learn an optimum policy from. Instead reinforcement learning typically assumes that the data is being generated by the learner's own behavior. So as you learn to drive the car, you will encounter various scenarios and in that process you will learn from your own experience. So that's kind of three main characteristics of RL problems that make it quite different from what we've seen before. And this in turn leads to two key problems that a lot of RL literature is devoted to solving, which are the credit assignment problem, which is the problem of deciding which decisions that you made were the good or bad ones or which actions that you executed were good or bad ones. Remember you've executed a whole sequence of actions by the time you've completed a task. And so if for example you've cooked a meal and your meal was not tasty, then it's important to figure out which of those decisions that you made when you cooked that meal were the good ones and which ones were the bad ones. And so that leads to this credit assignment problem because you only ever get feedback occasionally and that feedback might not correspond to something that you've done very recently. Instead it might be the cascading effect of something that you did a few time steps back in the past. The second thing that's also a key problem which is related to this third point above is that because you're generating your own training data, it becomes important for you to decide what to pick to try. So we mentioned that reinforcement learning involves learning through a trial and error paradigm, but then it becomes important to decide what to try out. And that is often captured as this notion of exploration versus exploitation. And we'll return to that and make that clearer as we get more into this. So in the spirit of differentiating RL from the other settings we've seen before, let's also try and identify what settings are not sequential decision-making settings and therefore would not be appropriate to apply RL to. So as an example, if you're just making a classification or a regression decision, which is just a single isolated decision, it doesn't really affect any future decisions you might have to make, then you typically aren't operating in a sequential decision-making setting. So for example, if you're trying to look at images on the web and identify them as images of mites or container ships and so on and so forth, then you're probably not in a sequential decision-making setting. If you're doing something like this where you're holding up your phone to identify what a particular sign is and there's an ML algorithm running on the phone to identify and translate this text, then likely, again, you aren't operating in a sequential decision-making setting because at this point, your only output, your only desired outcome is to be able to translate this well. And if instead of operating in that kind of single isolated decision setting, your setting looks a little bit more like this where you don't have supervision and you're operating in settings where your actions are going to have consequences that go into the future, that produce new observations that require new actions and so on and so forth, then you are operating in a sequential decision-making setting. We've already seen some examples of applications like robotics, like autonomous driving. Interestingly, you can also use reinforcement learning for language and dialogue. So for example, if you are trying to hold a conversation with a human, then the chatbot is the agent and the human is the environment. And so the chatbot has to execute actions which consist of seeing the right things in order to accomplish some task. Or you could do business operations like we discussed, you know, operating a power plant or trying to make financial decisions about which stocks to invest in and which stocks to divest from, for example, could be your actions and the stock market could be your environment. Let's see some more concrete and fun examples of reinforcement learning. So here is an example of reinforcement learning in the context of video games. So this is of course the Mario video game and you can see that this reinforcement learning agent has learned a pretty good policy for solving this video game. Here's an example of an application with a more real world style flavor where a bunch of robot arms are trying to learn to grasp objects from a cluttered table top. And so this result from a few years ago now, about five years ago now, was very exciting for reinforcement learning for robotics. And here's another example again in a robotic setting, but this time in simulation of agents learning how to navigate various obstacle courses. I like to call this robot parkour and you can see here that agents are these kind of humanoid style agents are learning to navigate some pretty impressively difficult obstacle courses. And in these cases, the agents are controlling the positions and velocities of all their joints at every instant. So that's the action. The action must control the angular velocities of these joints. Here's another example of a more complex robot, this time a spider. It's also learning to do the same thing. And then this funny looking humanoid robot that's operating on this obstacle course and it's learned a kind of really weird gate, but it works for it. It's able to actually even be robust to random forces that are applied. That's what those red bars are. It's able to learn even to be robust to those random forces.