 All right, having seen both model-tree and model-based reinforcement learning methods now, let's turn our attention to another kind of machine learning approaches that help us to perform sequential decision-making called imitation learning. Here is a particularly cool example of how humans learn through imitation, and this baby imitates the rocky montage, the rocky practice montage, and I find it really impressive that this toddler is actually able to perform all of the actions in the rocky action, a rocky montage. And there are papers also written on the topic of how infants are actually able to learn from televised models. Now, imitation learning like reinforcement learning is more general than just robotics, but it's worthwhile to look at how imitations are provided in the robotic setting. And in particular, for example, you could tele-operate a robot using controllers, you could use kinesthetic teaching, which means that you move the robot's arms to to figure out the robot's joints to help it figure out how to move, or you could use motion capture to record a human's motions and then translated into a robot, or you could use even web video and learn from third-person video of how a person is moving to imitate. All right, let's look at a technical approach for imitation learning and the approach that we look at at the beginning is the most simple thing. But let's start from the policy gradient approach for model tree reinforcement learning and look at that expression that we've seen a few times now. And once more, let's just parse it to remind ourselves. The first couple of terms have to do with summing over trajectories and summing over time steps within the trajectories. The third step is the likelihood gradient that says change the policy to make these actions more likely. Now, the fourth step in the policy gradient case was how good was this trajectory or how good was the rest of this trajectory, right? Now, if you had access to demonstrations from experts and the trajectories that you were summing over were demonstrations from experts and you knew that the experts were deploying the optimal policies, then think about what you might change here, right? Do you anymore need a critic to tell you how good this trajectory was? Because you're given an advance that this is a trajectory that was performed by an expert. Then why would you need to keep this critic around? So here is one suggestion for how to change this. If you have demonstration data and these are the expert actions that you're trying to make more likely, then you could completely get rid of that third term because expert actions are, after all, optimal and they're all equally optimal. That's the assumption. And so at that point, really, all that you're saying here is you're going to try and maximize the likelihood of the action given the input state at every time step of every expert trajectory, all right? And this should begin to look quite a lot like maximum likelihood train because that's exactly what it is. All that you're doing is maximizing the likelihood of the expert actions given the states that were input to the expert. And so you're simply treating this problem as a supervised learning problem where you take the expert trajectories and the actions that the experts performed at the states that the experts encountered, and you're using those s comma a pairs as your x comma y pairs in supervised learning. So you're really just doing supervised learning of this probabilistic model, maximizing the likelihood of it. It need not necessarily be probabilistic. But yeah, you're doing maximum likelihood training here to map from expert sensory inputs to expert actions. So you have this data set of s comma a pairs where s are the states encountered by the experts, a are the states encountered by the actions, and you're going to do supervised learning on that. This approach is called behavioral learning. So let's look at what that approach would look like in practice. You would have an expert, for example, correct a bunch of data driving a car. And you would take the states of the expert and the actions encountered the actions executed by the expert, make them your training data set, and then run supervised learning on it. And that's how you would train your policy by theta of 80, given ST. And typically you would use a policy that might look like a convolutional network, for example. So if you were using, if your input states were going to be images, then you might use a convolutional network whose output was going to be an action 80 that was going to have either left or right or drive straight or whatever the steering action is, or the entire driving action, it might involve a steering action, it might involve a throttle, it might involve a break, and so on. Okay, so one issue with this really simple scheme, and by the way, this simple scheme is super easy to implement, obviously, and it's surprisingly good in practice. It's very often very good. But a common problem that you run into with behavioral cloning is distributional shift. And this issue is essentially that, as we've seen, once you have a large enough data set, and you have a high dimensional high dimensional input space, and so on, it becomes really hard to train a perfect model, it becomes really hard to train a model with completely zero error that generalizes well, etc. And at that point, you should expect your model to be off by at least a little bit. So even if you train a really good model, it's very unlikely that it'll be perfect. And then what happens is that, let's say that your training trajectory was this black trajectory, and that's what you're trying to imitate. And your pi theta produces the red trajectory, where it's producing almost exactly the same trajectory at the beginning, but really over the course of time, the small errors in the policy compared to the expert keep compounding until you end up in a completely different part of the input space. Right? You could imagine this, for example, in a navigation setting where the expert policy was on the road, it was driving right down the center of the road, and the policy that was cloned on the, that was trained with behavioral cloning on the expert policy was driving near the center of the road at the beginning, but then kept drifting off and drifting off and drifting off until it was completely off the road. Right? Just through this kind of compounding error over time. So the cloned policy is imperfect. This leads to accumulating errors. And importantly, what happens is, once you're off the trajectories that you trained on, remember that the black trajectories were the ones you trained on. And once you're away from the trajectories that you trained on, then your encountering states that you've never seen before. And at that point, things are only going to get worse. Right? So you might imagine, for example, at this point that you're suddenly at a point where the states that you're encountering are really far away. And so things are going south very fast. So a solution to this that's very general and very widely used to fix behavioral cloning is dagger. And in dagger, the idea is to say that it's okay to encounter unfamiliar states when you first execute your policy. But then what you're going to do is go back to your expert policy and ask the expert policy to provide new labels. So for example, if a human who's driving the car is your expert policy, then you'll ask the human once more to tell you what you should have done on the new states that you encountered. And that'll give you your new data set that you will add to your imitation learning data set. And then you retrain your policy and you'll keep repeating this over and over. So to put that a little bit more clearly, let's first train our policy from the original demonstration data D, then run our pi theta that we've trained to get a new data set with with several observations. Those observations are going to be produced as a function of the actions that the policy executed. And so they're almost surely going to be from different domain from the data that we originally trained on, because you're going to go off the road, remember. And so you will ask the human then to label that data set with the actions that would have been optimal for those states that you encountered. So you learn, for example, that if you're wearing off the road, then you should correct to come back to the center of the road. And that will be the action that the human provides. And therefore, you can now aggregate that into your data set so that the next time when you train a policy and you run it, those same states where you are going off the road off the center of the road are no longer are no longer out of distribution. And so you don't fail quite as spectacularly. And the few failures that you have left are going to be relabeled by the human and you repeat this process over and over. And usually within a handful of iterations of this, you have converged to a policy that is able to perform fairly well. So this behavioral cloning plus dagger approach is quite widely used. All right. So we've seen one class of approaches for imitation called behavioral cloning, which essentially reduces the supervised learning on the expert SA tuples, plus some little tricks like dagger. There are several other kinds of imitation approaches as well. And maybe the second most important category is inverse reinforcement learning where you try to run reinforcement learning on a reward function that is inferred from human demonstration. So you'll explicitly use the human demonstrations to identify a reward function. And once you have that reward function, all that's left to do is run reinforcement learning. And there are other kinds of approaches that don't neatly fit into any of these categories. In particular, one approach that has become quite popular recently is called adversarial limitation and an important class of an important method in that class is called Gail or GAIL for generative adversarial imitation learning. There are also other approaches like model based imitation and so on and so forth. All right. Before wrapping up, let me show you some examples of the successes of imitation learning. This is an example of the NBDI AI car which was trained with behavioral cloning style approaches back in 2016 and managed to do fairly well on things like optical cones and on relatively traffic-free driving and so on. Here is another example. This is from back in 2008 from Stanford where they managed to train a helicopter to perform fairly impressive tricks by using inverse reinforcement learning approaches. And for example, you could teach a helicopter to fly upside down and all kinds of other stunts from demonstrations by a human expert. So this was really impressive all the way back in 2008. And here is an example of third-person imitation where you can see videos like this and get a simulated humanoid robot to perform those same tricks. And you can learn all these impressive acrobatic tricks by looking at human demonstrations of them, right? Third-person demonstrations with a different embodiment and so on. So you would effectively, in this case, recognize the human pose using computer vision methods and then try to imitate that exact same human pose with the robot and use that as an approach for imitation.