 I believe Ann is going to be helping me with that. Okay, great. All right, maybe we should get started, right? It's about one minute past. This is a good time to start, normally. All right, great. Hello everyone, my name is Dinesh and I am filling in for Khandrad who I hear is having a nice vacation in the Caribbean. So my talk today is gonna be kind of a, so I hear that most of the learning that you do in this class in terms of actual kind of pedagogical material is actually gone through, oops, that's not good. Ann, let me know if I need to change something on the audio settings on my end. Dinesh, I just, yeah, there we go. All right, all right. So as I was saying, so I believe that most of the learning that you do in this class is actually done through these interactive workbooks, which by the way is extremely cool, but that also means that I am not quite sure exactly what type of lecture to deliver. So I'm actually kind of doing a, so I'm doing a kind of, an experimental lecture format where by talking about some of the work that I do, I'm gonna kind of use that as the vehicle to kind of communicate some kind of broader ideas and lessons about deep machine learning, all right? So in particular, my talk, as you might see, is called How to Make Good Decisions to Supervised Deep Learning. And primarily what I mean by that is that you want to make good sequential decisions. Okay, so the problem setting that we'll be dealing with throughout today is sequential decision-making, all right? This is not a problem setting you've encountered yet, I hear, because we haven't gotten around, I believe, in five to two deep RL, which is typically the setting where you first talk about sequential decision-making, but the approaches that I'll cover today will be based on approaches that you've already learned before. So we won't be actually doing kind of what you typically think of as reinforcement learning in today's class, we'll instead be talking about how you can take methods designed for supervised learning and then apply them to the sequential decision-making setting. Okay, but first what is sequential decision-making, right? So really sequential decision-making is about making a sequence of decisions to maximize some success measure or reward. So you might imagine that this is the same problem that you're solving if you are a dog trying to get a nice treat, or if you're a robot trying to learn to run really quickly, or if you are a manager in a company trying to figure out what to stock, at which place within your pipeline to try and maximize the profits. So in each of these cases, you can kind of think of this as the problem of making a set of decisions, a sequence of decisions over time to maximize some kind of objective. Let's make that a little bit concrete. Okay, and before that, the rewards in this case, what's different about this setting from a supervised learning setting is that the rewards in this case, unlike something like a loss function which depends only on the current sample and the current classification decision or the current regression decision that you're making, the rewards in this case are not actually specific to every single decision that you make. The thing that you're trying to maximize is not a function of only a single decision you're trying to make, it's actually a function of the entire sequence of decisions. Does that make sense? So for example, if I'm trying to learn to run fast, I have to make a bunch of decisions in the process of running fast. Some of those might involve how to move my hands, some of those might involve how to move my legs. And you have to make that sequence of decisions over every instant of time, at the end of which, at the end of the race, you know whether you've actually managed to win first place or whatever. So you only know whether you've made good decisions at the end of the entire sequence of decisions. You might get soft signals for this at some point in the intermediate stages, but the final decision, the final kind of, the final feedback of whether you've done well or not really only comes at the end. So that's why it's kind of different from the supervised learning setting. So let's imagine now the case of a dog that's trying to get a treat. Yes, yeah, that's exactly the problem, right? So it's actually not easy to make good decisions because you don't know, to learn to make good decisions even, you don't know whether any instantaneous decision that you're making is good or not. You only kind of get feedback at the very end of maybe a hundred steps. You only get feedback at the end of it about whether you did well overall or not. And that feedback now applies not just to any individual decision that you made in the process, it applies to the entire sequence of decisions. So you get feedback for an entire sequence of a hundred decisions about whether that was good or not. And now you have to somehow figure out which of those decisions was actually good, which of them were bad, and so on and so forth. Right, so that's the, yes, that's right. That is right. So you can imagine, for example, that you're learning to ride a bicycle, you will learn a bunch of times, you will try a bunch of things, you will move your body weight this way and that way, you will pedal differently each time. Sometimes you will succeed at staying up for 10 seconds, sometimes you will fall right away. You have to somehow use those signals to then learn eventually a good policy for actually learning to ride a bicycle. Okay, good question, yes. Yeah, yeah, so good question. So the question was, can you do something like back propagation for this for kind of credit assignment? Because you can kind of imagine saying that I have this entire sequence of computations. Have you kind of looked at how you can think about computations in a deep neural network as like a computational graph that kind of sequences operations one after the other. So you can imagine that this entire sequence of decisions that you're making is a computational graph. One step in that computational graph though, is actually performed by the environment, right? Because you are doing some of the computation. Not all of this is kind of computation that you're performing. You're doing some of the computation, you're doing the computation that tells you what decision to make at every step. But then the environment is actually involved as part of this computational graph because the environment is actually going to take your, your action that you emitted and then transition into a new state. It's going to tell you, you fell or it's going to tell you, you're now upright on your bicycle, right? This is a part of the computation graph that's not actually available to you. You actually do not know in advance what the environment is going to do in response to your actions. And that is why it's actually not possible to propagate gradients back. Does that make sense? Good question. Okay. So let's now look at kind of more concretely what the kind of instantiation of decision making in each of these three kind of cartoon examples at the bottom of the slide are. So for example, if you are a dog that's trying to get a treat from its owner, and by the way, stuff like reinforcement learning actually originally came from the neuroscience and animal cognitive science kind of setting. It's in fact actually thought off as one of the potential or for a long time it was thought off. And you know, Conrad knows more about this stuff than I do, but it was thought off for a long time. It originally originated actually as a method for explaining animal learning, right? So you can think of a dog that's trying to earn rewards from its owner earn a treat from its owner. Its actions are the various muscle contractions all over its body. It's trying to learn to perform some new trick, all right? It doesn't know that it's trying to be that the owner is trying to teach it to perform a new trick at all. All that it knows is that it wants to earn the reward from its own. That's all that it knows, right? And its observations include sight and smell. You know, it can smell that the food is kind of being held up at some height perhaps. And it can see the owner and whether they're happy or not. It can see its own state. It can see objects around it. Maybe it sees a hoop that it must jump through, et cetera, right? And its success measure or a reward function for this, if you want to call it that is whether it gets the food in the end, right? That's the only signal it has for learning to perform the trick that its owner wants to teach. Does that make sense? Okay. Here's another example. For this robot, its actions are going to be the motor currents or torques that it can control throughout its entire body, right? And its inputs might then be the camera images, right? And its final success measure is is it running fast or not? This is a robot that's trying to run fast apparently, all right? And here is the case of a manager at a company who's trying to maximize the bottom line. They have to make decisions about what to purchase, what to stock in various portions of the inventory management pipeline. Their observations, what they're making those decisions based off of is the current inventory levels. Maybe it also includes other things like the state of the market and so on and so forth. And their final measure of success that they're trying to optimize for is are their decisions yielding profit or not? Any questions about this basic setting? So these are all examples of reinforcement learning or sequential decision-making settings. Again, I'm kind of trying to avoid calling it reinforcement learning today because we aren't really going to be doing what we traditionally think of as reinforcement learning methods. We'll instead be using supervised learning to solve these same types of problems, all right? Okay. So that is the sequential decision-making setting. And hopefully, actually, I think those two questions did a great job of kind of fleshing out why this is very different from the supervised learning set. So this is different because A, we don't actually have access to feedback instantaneously for any individual decision that we make. And B, you can't even think of this as just kind of back propagation through a full computational graph because some of the computation is being performed by the environment. Some of the computation is being performed by the environment because the environment tells you earlier you were standing upright and then you lean towards your right, the environment will tell you that that means that you're gonna fall next, right? So you don't actually have access to how the environment is making that decision for you. You don't know the physics of the environment perhaps or you don't know how the market is going to work in this case, for example, right? These dynamics are not known to you in advance, yes. Well, this setting is the sequential decision-making setting, reinforcement learning is a class of methods that tackle sequential decision-making. Today we will not be talking about reinforcement learning approaches for solving this problem. Today we'll be talking about supervised learning approaches for solving this problem. And the reason it's not straightforward is because of all those reasons that I just described. Well, reinforcement learning is an algorithm class. Sequential decision-making is a problem class. Reinforcement learning is a commonly applied algorithm class for solving sequential decision-making problems. So we'll see a couple of examples today. All right, good. And by the way, I mean, I have no idea how long this is going to take me. And really, the objective is not to get through it all so much as just kind of like give you a flavor of this new problem class and also use it as a vehicle, like I said, to communicate some broader kind of issues about machine learning regardless of whether you're interested in sequential decision-making or not. OK, all right. So let's try and formalize just a little bit this idea of having to make sequential decisions. You can imagine kind of a loop of interaction between an agent and an environment where the agent at every time step is making a decision about an action AT that it needs to perform. And that AT is going to be a function pi. This is traditionally called pi, the policy pi, which is a function of the input state ST. That input state actually came from the environment. So the agent, for example, observes the environment. The environment here is represented as the entire world. The agent observes the world around it, deduces that the state is ST. So this could be, for example, a camera image of the environment. It might be the state ST. The subscript T here corresponds to the time instant. So at any time instant T, the agent first looks at the environment, observes the camera image ST, and then performs an action AT, which is a function of ST. That function is traditionally represented as pi, which is often called the policy. So we want to generate, eventually, at the end of all this, if we are successful, we will have generated some kind of policy pi that can select the optimal actions. And by optimal, we mean that it maximizes the reward of a success measure on our last slide. So for the dog, it should fetch it the maximum food. For the manager, it should fetch the company the maximum profits, and so on and so forth. And we will do that based on seeing the observation ST at time T. Is this notation clear now? OK. So the first kind of algorithm class that we'll see for this today is behavioral cloning based on imitation learning. So one solution to sequential decision making is to learn through supervised learning. It is not obvious how to do this for all the reasons that we just discussed, but actually in a specific case where you have examples of somebody else performing the actions for you, somebody else performing good actions for you in advance, like, for example, maybe you're learning to drive a car, and you've observed a good driver driving the car for a long time, you can then say that even without figuring out whether this maximizes the overall reward, which is what you originally set out to do, you can just mimic the owner, mimic the driver that you already saw. Does that make sense? So if somebody demonstrates for you that this is kind of how to maximize reward or how to do well on this task, then you can just kind of blindly try and mimic that person. And that is really the idea of imitation learning. And in particular, we'll do this in the most simple way possible, behavioral cloning, which is surprisingly effective but also has its own problems. So in particular, OK, so I like to show this kind of fun example of a baby imitating Rocky on the TV. So imitation is actually such a strong drive in animals and in humans. It is such a strong way for a strong mechanism for learning that really, this behavior is observable very, very early on in a child's development, that you can kind of start seeing it imitating anything that it sees adults around it doing. And this is also true, by the way, for animals as well. And of course, for those of you who've been around really young infants, it's sometimes surprising how much they actually take on the characteristics of the people they see around them. OK, so this is a really fun video. So let me actually let it play out if it's going to actually can't see how much longer there is. But I will share these slides, and perhaps you can take a look. So let's go back to behavioral cloning for imitation. This is how this will typically look. You will have an expert who has given you a bunch of information about how to perform a task. They'll show you how they perform the task. So for example, this is an expert driver whom you have the opportunity to observe. The expert driver makes a sequence of actions, makes a sequence of decisions based on observing states S1 through S capital H. They perform a sequence of actions A1 through A capital H. So the way you should think about this is that first S1 was observed, then A1 was performed, then S2 was observed, then A2 was performed, et cetera, in that kind of loop that we saw before. So you have the opportunity to watch this expert at their job. They're very good at their job. You kind of record all of what they're doing. You treat that as your training data. Now can somebody tell me how you would use a supervised learning approach to solve the problem of learning to drive? It's not a trick question. It's quite easy. So if you want to learn a policy, remember what we said was we wanted to learn a policy. Pi, this is the thing that we want to learn. You can call it pi theta, because this has parameters theta that you want to learn. It's going to be taking as input a state S t, and it's going to be producing as output an action A t at any given time t. This is what you want to learn. Given that you've observed the expert at work and you have this data set S1, A1, S2, A2, and so on till SHAH, what do you do? Yes? You can do linear regression. Sure, you can do linear regression, or you can do really whatever your favorite policy, favorite algorithm classes among the algorithm classes you've seen. You've seen probably linear regression. You've already probably also seen neural networks and things like that. So you can use your favorite algorithm class for this and treat this as a supervised learning problem, at which point all that you really have to do is write out an objective function that says pi theta of S t, comma, A t. So this is some loss function. Have you been denoting your losses as curly L? OK, let's stick with curly L then. It's some loss function of the output that your model is producing given S t as input. And then this here is the expert action. So you want to kind of recover somehow through this policy that you're learning, you want to recover the action that the expert would have performed at the state S t. This you're going to do, you're going to sum over all the expert data that you recorded. Does that make sense? So you're just going to treat this as a supervised regression problem or a supervised classification problem. You know, the actions might be discrete or the actions might be continuous. It doesn't really matter. It's just a problem of trying to recover the same policy that the expert was using. Yes. Yes. So in this setting, we are going to actually think of it as the expert is in exactly the same environment and is performing the same exact task. It might, in some cases, for example, the case of the baby imitating the character on screen. The baby has a very different body. It's not even in the same environment, but still manages to do it. And these are actually somewhat harder problems within the imitation learning problem class. How do you actually mimic an expert who's driving a different car or who has a very different kind of body type? Maybe you're a robot that's trying to imitate a human. And as a robot, you don't have all the same degrees of freedom as a human might. How do you actually still learn to imitate in these settings? Those are harder problems. We won't actually deal with that today, but that's a good question. Any other, did I see another hand up? OK, so hopefully it's kind of clear how you would, once you've collected this kind of nice data set, hopefully it's kind of clear how this becomes effectively a supervised learning problem. So you just learn through supervised learning through something like this. And by the way, I've written pi theta of ST over here, but you can also equally well represent this as, you can think of this, OK, here, kind of what I have in mind is that you take ST as input, pi theta of ST returns an output A in. So you could think of this as ST, pi theta, and AT. But you could also think of this as, you give ST and AT as input to pi theta, and it tells you how good that is, which is kind of maybe the notation that this would suggest. Really, actually, if you've seen probability notation, this is basically actually a probability density. So you can think of this as basically just saying pi theta of AT given ST is some kind of probability distribution over which actions you should perform given that current state. So it might be, for example, that if AT equals 0, this value is 0.7, and if AT equals 1, this is 0.3 or something like that. Maybe you have two options of actions, and you give a probability distribution over those actions. OK, without getting too much more bogged down in the notation, really, what's happening here is just kind of supervised learning, supervised classification and supervised regression on top of this expert data set. So you could imagine that this policy now is a convolutional neural network, for example. The input state is ST. You've seen convolutional neural networks before, right? You've already covered these. OK, so if the input is an image ST, you would most likely want to use a convolutional neural network, and that might produce an output action to either drive to steer left or to steer right, all right? OK, so the behavioral cloning objective function, kind of like what I've written down over here on the blackboard, might look something like this, where you have, I've kind of written this down now, the loss function explicitly as a log likelihood loss function. You've seen log likelihood loss functions before, right? So all that we're trying to do is maximize the likelihood of the expert actions conditioned on the states that the expert experienced. All of this stuff is all the expert demonstrations and the time steps within those demonstrations. So you could imagine that the experts not only gave you one example of how to drive the car, but gave you multiple trajectories of how to drive the car. Maybe they drove down the same road many times. Maybe they drove down different roads many times, and so on and so forth, right? So within each demonstration, there are multiple time steps, and you're going to try and just kind of put all those time steps into a bag, sum all them up, sum the log likelihood over all of them, and that gives you an objective function. So in particular, you have a negative sign over here because you want to maximize the likelihood, so you want to minimize the negative log likelihood, right? OK, so this is basically just a supervised maximum likelihood objective function. You want to train a function that maps from expert sensory and expert actions. So all of this should seem very comforting to you. This is despite the fact that the problem of sequential decision making seemed kind of hard at the beginning, we've kind of managed to reduce this with the privilege of having access to expert data to something that looks a lot like problems that we've already encountered, right? And this actually works rather well. So there are some successes of behavioral learning style approaches that have been quite surprising. These methods have actually worked now for a couple of decades. So this is an example from 2016 of NVIDIA producing an imitation learning-based car that was really doing basically behavioral cloning with a couple of tricks here and there that was then able to actually learn to drive around fairly kind of complex obstacles and so on with these traffic cones. And I think it even also demonstrated some driving on traffic less streets that was reasonable and maybe with a little bit of traffic as well. How many times they failed before this actually worked? We don't know, but it's quite an impressive demonstration nonetheless. OK, so here is another example in a simulated setting of doing something kind of like what the question asked earlier, your question earlier, about trying to imitate an agent that is not just the same agent that you're learning the policy for. So if you're learning a policy for a humanoid robot like this, you might still be able to actually use demonstrations from people. And let me see if I can skip forward to the interesting parts of this. I think this is likely not going to work well with Zoom. So let me go back to you. Yeah, so you can kind of look at kind of this as an example of learning to perform an action, having observed a human perform it. If we, all these skills are also learned by looking at human videos in this case. So it was quite an impressive demonstration of the ability to actually learn from watching humans for a humanoid robot. OK, I can't quite get to the most interesting parts of that video at this point, but maybe I'll share the slides and you'll be able to look at it later. Good. So while it does work, there still remain several difficulties with behavioral cloning. One important thing that you should, yes. OK, question. Can we use Gaussian processes to make this more stochastic instead of deterministic like in linear regression so we get some uncertainty measure? Or is there another way to get a measure of uncertainty in decisions? Good question. So Gaussian processes are not necessarily the only way you can get probabilistic outputs. In particular, if you've done logistic regression, which you probably almost certainly have, or if you've done softmax losses for training your neural network classifiers, all of these are actually mechanisms for getting output uncertainties. So if you have discrete settings in particular, it's very easy for you to actually get output uncertainties. If you have probabilistic settings, you can still get output uncertainties fairly easily. You don't have to actually go to the extent of modeling a full Gaussian process. You can literally just output a mean and a standard deviation that, for example, parameterizes a normal distribution. If you're happy to be restricted to only outputting normal distributions, you can output a mean and a standard deviation, and that will actually allow you to get some uncertainty if you want. And then you can actually write down the objective function that we wrote down as inherently stochastic, because it literally said you want to maximize the likelihood. So this objective function is literally log of this likelihood, which is a probability distribution that you're outputting. And so it's quite natural to think of this as being a literal probability distribution that you're outputting. But you could, if you wanted to learn a deterministic function, just plug in a mean squared error there as an approximation to your log likelihood as well. Does that answer your questions, Pandana? I don't know how to interact on chat, but I will. OK, good. So let's get back then to where we were. OK, so one important difficulty with this reduction of sequential decision making to a supervised learning problem is that actually the policy is going to be trained on data that is different from the data that you will encounter as the policy. The policy is going to be trained on expert data. That's not quite the same thing as the data that you might encounter. And one way to think about this is that let's say that training trajectory looked like this. It's this black arrow here, that this black curve here. That's your training trajectory. Maybe you're trying to navigate literally down a 2D environment, and this is the trajectory you're expected to take. So what will you do? You will train a policy that says, for the state, you should perform this action, or some particular action that navigates like this. Or at this state, you should perform this action, and so on and so forth. You'll learn this policy through imitation learning, through this kind of supervised learning loss. But your policy might not be perfect. So almost always, if you've ever trained any neural network at all, unless you're training it for a trivial problem, you're going to get some non-zero loss at the end. It might be minuscule, but you will get some non-zero loss at the end. And so your model, your policy that you've trained might not be perfect. So even if you start at exactly the correct location, you might deviate from the exact correct action that the expert performed by just a little bit. You will perform just a little bit differently than the expert, and the problem now is that when you perform just a little bit differently than the expert, it's not just confined to that one time step where you perform just a little differently. What happens is that the environment tells you that if you perform a slightly different action, you end up in a slightly different state. And so things start compounding. So any little damage that you might incur from performing a slightly different action from the expert now starts to compound. So by the time you've performed a few actions, you are deviating more and more from the expert's original trajectory. You're finding yourself in states that are different from the states that you've trained on, that's different from the states that you've trained on. And that means that all of a sudden, you find yourself in states that you've never seen before that are very different from the states that you've trained on, and that just makes matters worse. So you have this problem of compounding errors. So at the beginning, you might be doing a really good job, but you will still be off by just a little bit. It's very difficult to get it exactly right, and you'll be off by just a little bit, and that little bit is enough to eventually compound to put you in states that are very different from the states that you've encountered, and that is really bad. Once you're in states that you've never encountered before, during training, that's really bad. So have you ever encountered, so far in this course, cases where you've looked at how well models generalize to data that is outside of the training distribution? So there are two types of generalization. That's maybe one of the meta points to make about supervised learning in today's lecture. One type of generalization is basically generalizing to other samples from the same distribution that you've seen. So you might, for example, say that my training distribution is all sampled from this region of the input space, and I've sampled these points. And very often when we talk about generalization, we talk about whether that same model that you trained on these points will generalize to points that you sampled from here. So this is what you would typically call as in-distribution generalization. And this is what we always want our models to be doing. We try and fight models that try to overfit the data. And when we talk about models that overfit the data, we talk about what we mean is that they're able to correctly produce the correct responses at the training samples, but actually start failing even at samples that were not in the training data but remained inside the training distribution. So that's what we normally mean by generalization. And I'm going to call that in-distribution generalization today to distinguish it from the problem of generalizing completely out of distribution. So you might also have the problem of generalizing to samples that are over here, which is a much, much harder problem, as you might imagine. So you can't just fight this by doing regularization, for example, and getting in-distribution generalization to happen. Once you're in this domain where you're outside of the training data, outside of the support of the training data or outside of the support of the training distribution, you're really operating in the blind. And that makes it very difficult. So we'll talk about out-of-distribution generalization a fair bit today. So that's the problem. So out-of-distribution generalization is hard and you start compounding errors. You start this vicious cycle of compounding errors. Anytime you make a slight deviation from the exact expert action and you end up in states that are out-of-distribution, which just makes matters worse. And so eventually, things blow up and you end up in a very different part of the state space than where you want it to be, all right? Good. So the clone policy is imperfect, leads to accumulating errors, and then you encounter unfamiliar states and you eventually fail. Another important issue with BC, which behavior cloning is also called BC, by the way, is that really the objective function that you trained on is not quite the same thing that you want, you set out to solve, not quite the problem that you set out to solve. The objective function that you trained on is some maximum likelihood objective function over the expert demonstrations. But that's not quite the same thing as maximizing the task reward, right? So we initially said we want to eventually maximize the food. It might not quite be exactly the same thing. It might be somewhat aligned, but it's not perfectly aligned, right? And in particular, the reason that becomes important is because it might very well be the case that if you exactly imitated, if you got to the exact global minimum of this objective function, that that also results in a policy that maximizes the reward. But the point at which it becomes difficult, so in other words, let's say you got your loss down to zero on this task, on this objective function. It's very likely that also maximizes the reward, but what it doesn't tell you is if your loss is non-zero, which in all likelihood it will be, it doesn't tell you how to distribute those errors well. Does that make sense? So if you are going to be making some amount of error in your model, what this objective function is telling you is that basically evenly distributing those errors over the entire data is perfectly the same. I mean, it's exactly the same as making really large errors on some samples and making really small errors on some other samples, et cetera. So it doesn't really express any preference about where to make mistakes. Does that make sense? So even if the zero error solution to this objective function is actually the same thing as the zero error solution to maximizing your reward, the non-zero error solutions are important. And this is kind of important in optimization because you need to get signals when your policy is not actually perfect yet. That's when you're actually trying to optimize it. But then this reward or this loss function is telling you nothing about how to distribute those errors. Does that make sense? And so effectively BC is explicitly treating errors at all time steps as equivalent, but in reality, some time instance, always matter more than others. So if you're performing a sequence of decisions, some of those decisions are going to be much more critical to actually solving the task than some other decisions that you might make. Does that make sense? Okay, so that's actually going to be an important observation for the two things that we might cover today, given time, but maybe we'll only cover one of them and that's fine. All right, so the first thing that I'll talk about is kind of this interesting recent project that we performed where we talk about focusing on the right time instance for imitation. So to set that up, let me kind of back up to a problem that's been around in imitation learning and noticed in imitation learning for some time now. Let's say that you're trying to learn to drive your car. In the way that I kind of have set up this problem so far, I have talked of it as a mapping from the current observation that you get to the current action that you must perform. But in reality, sometimes you don't want the input to the policy to only be the current observation. You might actually want it to be also previous observations. And the reason you might want this is because maybe the current observation doesn't tell you everything that you need to know about the scene. So for example, it might be that as you're driving your car down the road, a pedestrian goes behind a car that's parked on the side. They're not immediately visible to you, but this is something that you should account for. You have actually seen the pedestrian just a couple of time steps before. You should maintain that in your memory and then actually act on the basis of that. So you should be careful as you drive down this road to avoid hitting this pedestrian. But the pedestrian is not immediately instantaneously visible to you. So you actually want to have previous observations be input into your policy. Is that motivation clear? Okay. So historical observations should be useful, but unfortunately, it's been noticed now for some time that they don't always happen. And in particular, the issue really is that out of distribution generalization issue. So what happens is that you do indeed manage to, so this perplexity, by the way, is have you encountered the term perplexity so far? Perplexity is basically just another word for negative log likelihood. So negative log likelihood is what perplexity is. And so you want to have low negative log likelihood. And so this is actually good. The bolded terms here correspond to good things. So on the validation data, which is held out data from the expert demonstrations, just like you would do in supervised learning where you have a validation set, this is held out data from the expert demonstrations. Would you think of that as indistribution or out of distribution generalization? That's indistribution generalization, good. So an indistribution generalization, when you do have a history as input, then you do well. However, when you start actually driving down the road, we've already seen that once you start executing a cloned policy, a behavioral cloned policy, in the world, you start deviating from the training distribution. So you start encountering states that are different from the training data. And therefore, really all of these columns are measuring things like in the distance you manage to travel, the number of times a human had to intervene to avoid a crash or the number of collisions you made. These are things that are actually a property of the out of distribution generalization ability of the model that you learn. Because you're actually operating most of the time in the environment, you're operating on states that you've never seen before because even minor imperfections in your cloned policy are going to lead you to states you've never seen before. And so you can see that indeed, the indistribution performance does agree with our intuition that using history as input should help solve this problem better, but the out of distribution performance is actually letting you down. So this is kind of an interesting problem that's been observed now for a couple of decades in imitation learning. And we recently had some interesting insight on this that's really simple, which is quite suitable I think to also kind of take some broader take home messages about imitation learning or about supervised learning really. So why does BC from observation histories sometimes perform poorly? That's the question we set out to answer. And so the answer to this, it turned out to lie in a phenomenon that happens called copycat shortcuts, or we kind of term copycat shortcuts, really all that it's doing is the agent rather than paying attention to the environment to decide which actions it should perform, when you train it with a sequence of past observations as input, it learns to instead deduce the previous action rather than to try and predict the next action that should perform. All right, so it tries to reduce the previous action and then just tries to copy it. So here is an example of why this might happen. Expert data, good driving always looks like the actions are smooth over time, all right? Good driving looks like the actions are usually smooth over time. So for example, you might have an expert demonstration that looks like this, where they were breaking at a red traffic light for let's say a hundred time instance in a row. I've only shown two over here. And then the traffic light turns green and the expert starts to press the throttle and then they press the throttle again continuously for the next hundred times steps, all right? That's an example of a good expert demonstration. Now, a copycat policy would get away on this kind of data almost all of the time, it would get away by just copying the previous action. So for example, at this instant, given that the expert was also breaking before this time, the copycat policy could say, I'm just gonna imitate what the expert was doing earlier at the previous time instance, because that is available to me as part of my input and it's very easy for me to learn to do. And so I'm going to learn to mimic that. And similarly over here, I'm going to learn to mimic this breaking action. And over here, I'm going to learn to mimic this breaking action. And over here, I'm going to learn to mimic this throttle, all right? So I'm just going to take the input that's coming into me, which is my ST minus one, ST minus two, ST minus one also includes the information that I was breaking at the previous time instant. And I will say, I'm just gonna learn a trivial mapping from this to my current instance. So ST minus one, let's say, also includes my previous action, AP minus one. In our experiments, it doesn't always have to include it explicitly, but the information is implicitly present anyway and the agent actually learns to deduce that information. But for the moment, imagine that you explicitly also provide AT minus one as input. You can just learn a trivial mapping and identity mapping from AT minus one to AT. And that actually does a reasonably good job on this training set, right? Yes. Yeah, exactly. So that's exactly what we would like for it to learn. That would be a much more complicated function that it has to learn, right? So it has to learn this function that says, I have to, in my image, figure out where in the image a green light is. I have to figure out that it corresponds to a green traffic light. And in response to that, I have to make an action that involves, you know, pressing the throttle, right? But instead, it learns a very trivial policy, but just copies the previous action, which is available to it as part of the input. And it does almost as well. That's right. That's right. Exactly. So, yeah. So to back up a little bit, right? So if you really only train this policy without ST minus one and ST minus two as inputs, if you trained it with ST alone, then you wouldn't see this kind of behavior emerging. Like you actually don't see this behavior emerging where the agent just learns to imitate the previous action because there is no way for it to learn to imitate its own previous action in this case. Right? But the reason we wanted to, so on the previous slide, I gave you this example of like a pedestrian getting hidden behind a car, right? That's an example of a setting where you do actually want to have access to information from the previous time instance. But when you try to solve those types of problems by providing access to previous time instance, right? When you start introducing these previous time instances as input to the policy, then it turns out that there's this pathology that emerges in your deep learning solution that it starts to pay attention to the wrong things. Instead of paying attention to the right thing, which was the green light that was indeed present as part of ST, it instead says, actually, you know what? I don't really need to do any of that work at all. I can do the lazy thing and literally just imitate what I was doing at P minus one because that is almost always going to be right. You do provide the ST as input, but the solution that's learned actually ignores ST. That's exactly the problem, right? Yes. Yeah, good question. And through the experiments, you'll see that. I mean, at this point, I just have the schematic for you, but through the experiments, you'll kind of get to see how we do actually evaluate that this is true. Yeah. So this has actually been also called by different names before, like the inertia problem, et cetera. It's been a problem that's actually been noticed in the literature for some time, but you'll see in the experiments that we actually evaluate that this is true. Okay, so this is the insight that what's happening here that's causing this agent to start generalizing poorly to out of distribution is that it's learning a trivial solution, a trivial solution that involves just repeating the previous action. That trivial solution, it turns out, is actually sufficient to get this type of generalization happening. Just copying the previous action works pretty well on held out data from the expert itself, right? But when it starts encountering its own new states in the environment, which is this type of generalization, then you don't really, it doesn't really work well anymore, right? Okay, I saw a bunch of questions, I saw yours first, yes. Do you mean that the current expert action ST is break? Well, so I should mention that STs are not actions, right? So STs are actually current states. ST minus one is also actually only the previous state, but you can, for example, by looking at ST minus one and ST deduce that your car is still and therefore you've been breaking between ST minus one and ST, for example, right? So the information about AT minus one is available in the sequence of past few states. That's really kind of what I've kind of like brushed that under the rug a little bit. It's not that we explicitly provide AT minus one as input to the policy, but that information is already available anyway in ST minus two, ST minus one, ST. Once you provide a sequence of three states as input or a sequence of K states as input, the information about the past actions is quite easy to deduce. You can easily deduce that, okay, I've been steering left or that I've been pressing the brake and so on and so forth. I saw a couple of questions there, yes. Yeah, so why are we getting a better perplexity on the training data, on the validation data? Why are we getting better perplexity if the other metric is, sorry, say that again? What was the question again? No history was actually better, but the reason we're getting better perplexity is because perplexity, validation perplexity only measures generalization in distribution, right? And if you're always able to get expert actions at the previous time step and just repeat them every time, that's kind of what this figure is trying to motivate. If you just try and repeat the last expert action, you're almost always going to do well. So in this cartoon figure, you're doing well three out of four times, but you can imagine that each of these phases where you're braking and where you're throttling is actually a hundred time steps long, in which case you're doing exactly the right thing for 199 out of 200 time steps. There's only one time step when you actually make a mistake in terms of validation perplexity, which is just this one instant where things are changing. Does that make sense? Yeah, the validation perplexity is not a good measure. The validation perplexity is actually the same thing as the training loss, just measured on the validation data. And so it connects back to the things that we pointed out earlier about imitation learning or behavioral cloning, not really measuring the correct thing. It's not giving you the right signals during policy learning. Yes. This example here is talking about validation perplexity, which is about indistribution generalization. That's right. And I'm claiming that this is actually good performance here. This is 75% right on this like toy data set. Really, it's intended to kind of show you the path towards it being even better. So you can get like 99.9% performance here. If it happens that almost all of the time your expert is just repeating their previous action anyway, and only at a few sparse instance in your driving trajectories, does the expert have to do something dramatically different from the previous instance. Right? Okay, good questions. Great. So this is now the problem that we're dealing with. And of course, it should be clear. Is it clear already why this type of policy would not generalize when you start performing outside of the training distribution? So what will happen, for example, is that literally what happens actually, and what people have observed is that you can actually start imitating, start rolling out this policy in your environment. At the very first instance, the car is stationary. And at the next instant, the car just decides to continue to be stationary. It doesn't even get off the ground. And people have actually invented kind of like weird hacks to try and get around this. So they will literally at the first few instance force the car to press the throttle and then start executing the imitation learn policy and so on and so forth. But of course, that doesn't work well. If you come to a stop again at a red light, then you're never going to take off again. There are all these really weird issues that happen once you start imitating with previous states as part of your input. So that's the thing that we're trying to fix. So it's disastrous when you roll it out. By the way, rolling out is kind of the word for executing the policy in the environment. Yeah. Basically, why did you see that, say, for four years? Not quite. So we aren't really doing anything particularly smart here at all. All that we're doing is we are saying that previous states, literally the previous instant, we're not talking about something that happened somewhere in your training data. We're literally talking about the previous instant because it's available to you as part of the input to your policy. It's trivial for you to learn a mapping from the input to the output that just consists of mimicking the previous action. Yeah, so it turns out that's a good point. It might strike you as being very specific to the problem of driving cars because driving cars does typically involve smooth actions. But actually, it turns out a large number of actual control problems in robotics do require you to perform smooth actions. And this always tends to be a problem. So we don't just evaluate this on the driving setting. We actually evaluate it on a couple of other robotic control settings as well. So the problem of moving my legs to perform a walking motion is actually a fairly smooth process in itself. And even though I'm kind of setting this up as discrete actions where you either break or you throttle, this problem also arises in settings where you have continuous actions, where you're moving forward at a velocity of, let's say, 1 meter per second at time instant 0.001. At time instant 0.002, your velocity is instead 1.001 meter per second, and so on and so forth. You can do a really good job, as you might imagine, by just copying what you were doing in the previous instance. Does that make sense? OK, great. So what is causing the learner to learn shortcuts? We've already kind of seen this. We've already kind of answered this question in some way. But let's actually, oops, this is not what I intended to show first, but that's fine. So let's look at actually an abstraction of this, because I think it's an interesting kind of broader lesson to take about auto-distribution generalization. So this phenomenon of shortcuts is actually not something that we coined. It's been coined in the literature for a couple of years now. And essentially shortcuts are solutions that are somehow trivial, and they do well on indistribution data, on generalizing to indistribution test data, but they fail when you start generalizing to when you start asking them to generalize to auto-distribution data. So this should be very reminiscent of exactly this kind of simple copycat shortcut that we saw in the previous slide. So let's see an actual type problem that demonstrates this. So for example, your training set might have two categories. So you're performing, let's say, a categorization task. A and B are the two category labels. A images look like this, and B images look like this. And let's say that you're trying to now test for indistribution generalization. Let's say that you have this test set. I think you and I would all agree on the labels that you would provide to these images versus these images. You would call these A, and you would call these B. Is that fair? And in fact, that is also what a neural network trained on data like this would predict. It would actually get the answer right. It would actually predict that the labels for these are A and the labels for these are B. So both the neural network and us apparently have correctly concluded that star shapes corresponds to category A, and moon shapes correspond to category B. That's our kind of semantic understanding of what we've learned. Now things start getting complicated when you provide this test set, which is out of distribution because you've never seen a star in the top left corner, for example. You've never seen a star in the bottom right corner. You've never seen a moon in the bottom left corner or a moon in the top right corner. These are out of distribution. And here, actually, I mean, this is a schematic. This is not actually a real experiment, but it's a nice schematic. It kind of gives you the intuition correctly. Here, you and I would say that the stars are A's and the moons are B's. But a neural network might very well learn that a neural network might very well instead learn that the stars are B's and the moons are A's. Because perhaps all of the neural network is really learned from looking at your training data is that anything that is in the top right or in the bottom left is an A and anything that is in the bottom right or in the top left is a B. Perhaps that's all that it learned. And by that understanding of what the data is telling it to do, the neural network might very well call this a B and call this an A instead. So it has some kind of spurious correlations that are emerging during training. They do generalize if the data is from the same distribution at test time, which is the case in this middle row. But once the data is from a different distribution, it really helps you to distinguish between whether the concept that was learned was this concept of distinguishing the shapes or this concept of distinguishing the locations where the shapes appeared. I saw a hand up there. That is an interesting question. Can this problem be solved using a convolutional network? Just because the convolution has this bias that it's operating in the same way regardless of where you are in the image. Well, a convolution does not really get rid of the information that the star is presented the bottom left versus in the top right. It still does retain that information. And so it doesn't really throw away this information that your input data always contain the stars in the top left versus in the bottom right. And therefore, it still allows your network to potentially learn a solution that looks like this. That looks like this one. So your network is not so to make that clearer, your question would have been, the answer to your question would have been right if it was the case that you had somehow developed a mechanism that was literally invariant. It did not even retain the location associations. When we say CNNs are invariant, people sometimes call CNNs invariant, sometimes call them equivariant. But the correct term really is equivariant in the sense that CNNs are producing, they're retaining the location associations. The output of a CNN feature map contains all the location associations in the input. So if you shifted this image, if you shifted the star to the right by 10 pixels, the CNNs feature map would also shift to the right by 10 pixels. That's really what we mean by CNNs are doing some kind of invariant processing. The outputs of CNNs are still retaining those locations and therefore it's possible for you to still learn the wrong solution, this wrong solution over here. Does that answer your question? OK. IID, well, IID is originally independent and identically distributed. I actually don't like using. So IID, I think you've probably seen it at some point during this course. So IID is independent and identically distributed. It just means that you have some training distribution from which you sample the data and you're sampling the data independently. So identically distributed just means that all the data is sampled from the same distribution. Identically, sorry, independent means that you're sampling every data point independent of all the other data points that you've sampled. So just all the data points are sampled. If you're kind of familiar with probability notation, all your data points, which consist of xi, yi, are sampled from some distribution p of x, yi. That's really all it means. OK, good. Was it another hand that I haven't gotten to? So we are a little behind. Yes. All right. Oh, I see actually there are six chat messages. Sorry, I'm really bad at monitoring this stuff. Can we use perplexity needs to be small? Yes. OK. Can you explain the table again? I'm confused between history and no history. So history in other distribution will result in high perplexity, which is bad. But here, because it is in distribution, it's performing better than no history. So that's right, Spandana, roughly right. Now we don't really have labels for auto distribution data, so we don't really measure things like perplexity on the table on the previous slide. We don't really have an actual perplexity measure. We instead just measure its driving performance in terms of the distance that the car drove, in terms of the number of times you had to intervene, or in terms of the number of times it collided with an object on the road. So it's a proxy for auto distribution perplexity, because we don't really happen to know. The only way you can measure perplexity is if you happen to have labels. And for the auto distribution data here, the data that your car is encountering when you roll out your policy, that data doesn't actually come with labels. So you don't really know what the optimal action to perform there is. So you can't really measure an auto distribution perplexity, per se, but you can measure the driving performance, which is a pretty good proxy for the auto distribution perplexity. And really, the driving performance is actually what we care about in the end. We don't really care about matching an expert. We care about doing well on the driving task. Hopefully that answers your question. OK, good. So now we've seen shortcut learning. And by the way, shortcuts do happen all the time in deep learning. It does very often happen that your networks learn the wrong thing. So a classic example of this is, for example, you might have seen cows in your training data set for learning to classify animal images, but you might only ever have seen cows against a lush green background. And then all of a sudden at test time, you're presented with a cow against a beach background. And this is something that you've never encountered before. Your network might have done very well on cows on lush green backgrounds, but it might have done well by just learning that lush green backgrounds always correspond to cows, rather than actually looking at the shape of the cow. And then when you put a cow on a beach, it might get confused and call it a whale or something instead. In fact, this is literally what happened. It was reported in the result a couple of years ago. And this paper, by the way, at the bottom, I should have remembered to include the actual name of the paper. I think it's called shortcut learning in deep neural networks. It's a pretty light read and kind of gives you a nice overview of some of these results. It compiles these results and provides a nice kind of explanation of these kind of various problems that have been reported from various different disciplines, computer vision, natural language processing, and so on and so forth. So having seen this kind of brief aside on shortcut learning and how it manifests in auto distribution generalization issues, let's go back to our copycat shortcut in the behavioral learning problem. Why does the copycat learning problem arise? Well, it arises because there is a very small, this is our typical objective function for training the policy. It looks kind of like that log likelihood objective that we wrote earlier. It's a summation over all your training data of the difference between the output of your policy and the expert's action. All your training data is actions from the expert annotating the observations that the expert made. OT, by the way, here is just the same thing as ST. It's just a different notation. OT is the same thing as ST. And so yeah, all that you're doing is solving the supervised learning problem, trying to regress to the expert actions given the expert states as input. And as we said, there's only a very small fraction of these that you get wrong if you do the wrong thing. Meaning if you just limited the expert actions from the previous time step, if you just copy them, that shortcut is actually really good at this MSC minimization problem. It's almost perfect on this MSC minimization problem. And because expert actions are highly temporally correlated, it's kind of why this happens. The demonstration has a really tiny fraction of samples where something dramatically different happens between consecutive time steps. And therefore, you don't really get penalized for doing this really stupid thing of copying what you did in the previous time step. OK, so one way to think about this, and this is kind of how we'll motivate our solution. The solution is really simple. Because you have this really highly imbalanced data set, these change point samples, for example, the point at which something in your environment changes dramatically, like your traffic light changes from being red to being green, which demand very, very different actions from your policy. Or let's say a pedestrian who was on the sidewalk suddenly comes onto the road. These two things demand very, very different actions from your policy. When something important changes, that's when you need to actually be doing that. Those are time steps that are more important than the other time steps where nothing new has happened in the environment. And so it's important for you to try and fit those change points well. But unfortunately, your data set contains a very small fraction of those change points, which is really what's causing this problem. Is everybody with me? And so that's really what's causing the Copycat policy. Can somebody suggest what you would do in a setting like this, where A, your data set is imbalanced, and B, the underrepresented class, or the underrepresented portion of the data within your data set is actually the stuff that you really want to get right. What should you do here? Yes, you could up sample the underrepresented portions of your data set where you do want to do well. You could say that I'm going to make 10 copies of them instead of just having one copy earlier. And maybe now, instead of the ratio being 99 to 1, it becomes 99 to 10, which is a significant improvement. Yes, very good intuition as well. So somehow you are going to kind of use information about whether the expert is dramatically changing their actions or not. And that's going to help you determine what you need to do to solve this problem. Yes, how do you assign more weight to the current state, though? So it's not the case in methods like the ones that you're covering in this class, like deep neural networks. It's not the case that you can multiply a weight with one of the input dimensions and use that to indicate the importance of that input dimension. It's are you talking about in the objective function or when you actually, yes, yes, so in the objective function you're going to assign different weights? Is that right? That's good, too. That's a great idea as well. And it's like the more versatile version of the idea that you had earlier about repeating a sample multiple times in your training data, which also does something kind of similar to what you're just saying. OK, did I see another hand up somewhere? All right, so that's actually what we'll do on the next slide is we'll literally take this objective function and say, actually, rather than communicating what this objective function does to the policy, which is that every sample ot comma at in the expert data is equally important. We're going to try and assign different weights to different samples. And in particular, we'll try and assign weights in a way such that you can emphasize the points where you really want to do well, which are these change points, which are also the points where you're currently doing really poorly. All right, so, OK, so towards doing that, let's actually start working towards that solution. We need some way of determining which points to wait well. And that means that we need to determine these change points in the expert action stream somehow. Whenever the expert is doing things that are different, that cannot be predicted purely based on past actions. By the way, OK, so this is a good point at which for me to make this story a little bit more complex than what I've been saying so far. So far, I've been saying that the problem is purely because you literally do the same thing the expert has been doing at the previous time steps. That's actually just the cartoon picture of the story. The real story is that you can actually not just literally imitate what the expert was doing at the previous time step, you can actually predict forward from what the expert was doing at the previous time step. If the expert has been slowly increasing the throttle, let's say the throttle is a continuous measure, if the expert has been slowly increasing the throttle for the last few time steps like this, and you're currently at this time step, it's quite easy for you to generalize this forward and say you want to reach that. That's what you want to do. And you can do this, again, purely as a function of the previous actions, ignoring all of the important information in the current state. And that's a problem again. So it's not purely a problem that you copy exactly what the expert was doing at the previous time step, which would correspond to doing something like this. You could also learn kind of more complex functions of the previous time steps, like saying that my next few actions should look like this. Does that make sense? And you want to avoid that problem as well. Okay, how much time have we got? We have about, we have about five minutes, or what is it? 15 minutes, okay, good. All right, that's I think that's a good amount of time for us to wrap this up there. So what we're going to do is we are going to construct an optimal copycat policy. So towards eventually constructing an optimal driving policy, the first thing we're going to do is construct the optimal copycat policy. And what we mean by that is a policy that literally does not have all of the important information available to it. It only has the previous actions available as input to it. So you're literally going to construct the optimal version of a policy that only depends on the past actions. So it might learn this behavior, it might learn this behavior, and so on and so forth, right? So in particular, you're going to learn a policy which rather than having the last few states as input, it's going to literally have the last few actions as input. It doesn't even have any information about the state. It literally cannot see if the traffic light is turning green. It literally cannot see where the pedestrian is stepping out of the road or not. It literally only has access to the past few actions. It's going to try and do the worst possible job in terms of the problem we've been trying to solve. It's going to try and learn the optimal way to literally only condition its actions on the previous expert actions. So it will literally just learn to extrapolate the expert actions. And that's what we'll call the optimal copycat policy. And this is going to be helpful because we can then look at when this optimal copycat policy is going to fail on your training data. And that'll give you a signal about which points in the training data are change points, right? Whenever you fail at extrapolating the previous actions, or whenever you fail to produce the expert actions by merely extrapolating the previous actions, that's a strong signal that there is something important happening over there. That's kind of the intuition that we're going with, right? So again, copycats are correct most of the time on the training data like we've been discussing. And so again, you know, if you have a sequence of observations that looks like this, where the traffic light changes at T equals four from red to green, this kind of copycat is going to always be right everywhere except at this one instant when the traffic light changes. And it might have learned, for example, to literally mimic the previous step or maybe it conditions it on the past couple of steps, whatever it does, it's only able to access the previous actions as it's input. It learns some kind of extrapolating policy, right? And the only points in time that actually break this copycat shortcut that we've deliberately trained now, this is now no longer kind of a phenomenon that has happened. It's kind of an annoying phenomenon. We've deliberately reproduced that phenomenon in this copycat shortcut policy. And that is going to help us identify where the change points lie. Yes. Yeah, so actually what we'll end up doing is actually going to look a lot like at a boost. It's going to look a lot like boosting basically. So we will literally take the errors of this model and say that those are the errors where we want to do better. But yeah, we are going to upwait the change points so as to avoid the pathological solution of literally mimicking the previous actions or extrapolating only the previous expert actions. We're going to try and make that solution less appealing in the optimization process by making the error corresponding to it to be higher. Yeah, L1 regularization. I'm not sure why L1 regularization should fix this. Sure. No, so methods like L1 regularization are primarily useful for generalizing indistribution, right? Yeah, you do want to wait those errors as well. You're not going to literally drop those errors. You're just going to have them have smaller weights so as to avoid kind of learning this pathological solution that does well on those points anyway. Does that answer your question or am I missing something? Maybe come to me after. I feel like I'm not quite getting what you were asking. Maybe come to me after the talk. All right, good. So again, the difference between the copycat policy and the reward optimal policy is that both of them do get low training and validation losses. The copycat policy gets low environment rewards because that demands outer distribution generalization, whereas the reward optimal policy, the one that actually learns to drive well will get high environment rewards just by definition. And one way to think about why it might sometimes be the case that the copycat policy produces as low or even lower training error than the reward optimal policy is that the copycat policy has to learn something very trivial. Its only source of error is change points in the expert data. The reward optimal policy is learning a much more complicated mapping from the input states to the expert actions. There are many, many potential sources of error that might happen there. You might have optimization noise or you might have limitations of your model. Maybe your model is literally not capable of representing the true mapping from expert states to expert actions, or you might have some sources of stochasticity. All of these might actually cause errors for the more complicated model, which also happens to be the correct model in this case. So that's one reason why the optimization process, despite the fact that there are these change points that do break the training error, maybe that error is actually literally lower than the error that you get from doing the right thing on the training data, because the process of doing the right thing is actually harder on this type of training data. Okay, so that's why we want to try and make this kind of trivial copycat solution, the shortcut policy, we want to make it a less attractive solution than the reward optimal policy. And by making it a less attractive solution, we hope that the optimization will then actually learn the correct reward optimal policy, right? And how do we make this a less attractive solution? Well, we are going to upgrade the change points, all right? Remember, the change points are where the copycat policy fails in the first place. So we are going to upgrade the change points just like somebody in the audience suggested earlier. We're going to take these points that originally looked like this, and instead wait them. So we're gonna take that point where there was a change point, increase its weight, and that is going to manifest in the objective function like this. So instead of the original objective function that looked like that, we'll have this WT, where we'll have higher weights assigned to the change points in the expert demonstration, all right? Really simple solution. And now the only question is, how do we identify and upgrade those change points? The solution to that is literally for us to say, let's take that optimal copycat policy that we trained, look at where it makes errors, right? And once you've done that, the literal, the mean squared errors of the copycat policy on the training data, those can be literally used as the weighting function, right? So you can literally, on every sample corresponding to time T, you can measure how far away the output of this optimal copycat policy you trained is from the expert action. And you can say, hey, I've identified a point at which the optimal copycat policy cannot recover the expert action correctly. That should be a policy, that should be a time step that I should pay more attention, right? So here's an example of this, again in our kind of toy example of, stopping at a traffic light and then starting, you can see that the action prediction error, which is this guy here, the error between the copycat policy and the optimal expert action, the action prediction error in red over here is highest when you first come to a break at the red traffic light, and when you start accelerating, when the red traffic light turns green, right? So these are the two points at which there is a significant action prediction error. And so that's exactly what we would like as well. When things change in the environment, when your expert actions are no longer predictable, purely from previous expert actions, those are things that you should start paying attention. So our final objective function just looks like this. I believe we had, so we literally say that higher action prediction errors should correspond to higher weights. You can design any monotonic non-decreasing function for this. I believe we used a kind of sigmoidal function eventually, so basically a function that evolves like this, where the weight is going to look as the AP increases as the average prediction error increases, the weight is going to go up, but it doesn't kind of go up in an unbounded way. It kind of stays between two limits, right? Okay, any questions about this? Okay, and so really, the only two things we've done to fix this is we first deliberately trained a Copycat shortcut policy that only had access to expert actions. It's actually really, really simple to do because it's a very small model because all that you're providing as input to it is the previous actions of the expert, which are very low dimensional data compared to something like images that you might train on. So you can actually do this really quickly. And then you can use the errors of it on the training data to, like somebody suggested, basically kind of like what you would do in boosting. You've already seen boosting in this class or have only a subset of you seen. Okay, good. So if you've seen boosting, then you can use the errors of this model on the training data to upgrade samples for training the eventual target BC policy. And that's all that you do. And unlike boosting, you don't kind of have to repeat this multiple times, you just have to do it one time, you know, all right? Because you really won't want to train like big, deep neural networks multiple times, sequentially one after the other like you do in boosting. So it's really not that expensive at all. It's quite easy. Okay, so one thing here that is, that kind of answers the question that you had earlier is how do we know that actually this copycat problem is actually what's happening, what's causing our poor performance, right? And to do that, one thing that you can look at is the errors of the optimal copycat policy that you've trained. Remember, those are what we're calling the action prediction errors. Those action prediction errors are actually very predictive of the baseline errors. So if you train the baseline, which over here we're calling BCOH for behavior cloning from observation histories, its errors along the particular trajectory look like the blue line. The average, the action prediction error looks like this green line. And you can see that the green line peaks at several of the same spots where the blue line peaks. So the errors of the baseline policy, the baseline BCOH policy, are actually quite closely tracking the errors of this trivial copycat policy that we deliberately trained, right? And that gives you some clue that what is going on, the pathology that's happening in BCOH is actually quite similar to this deliberately created, weird copycat policy that we trained that literally has no access to the environment stage. Does that answer your question from before? Okay. Do you have something that I can address now or? Okay, all right. Good. And you can actually lower those errors. You can lower those error tendencies by instead upgrading those points, as you might imagine, by upgrading those points in the data, you can actually lower the errors and that leads to much better performance eventually. So you can get the performance up from, BCOH here is the baseline, BCSO is another baseline. Here are a couple of other methods that have been designed in the past few years for tackling this problem. And you can see that this, and these are typically much more complex approaches that involve kind of reasoning about causality and so on. It turns out that just literally identifying the change points and upgrading them can actually fix this problem. And this is true not only on the Karla driving environment but also on these other robotics inspired control tasks which are also sequential decision-making problems. Okay, given that we have only about five minutes left, actually, let me show you some other examples of what happens when you do baseline behavioral cloning. Like I said, it doesn't really learn to either press the throttle or to break or to do anything dramatically different very well. And so it often leads to collisions like this where you literally cannot stop in response to a car in front of you stopping. You literally run into it, whereas you do more sensible things when you learn a better policy, obviously. Here's another example where I believe you come to a, again, the car in front of you is stopped and you just run right into it. Here's another example of that kind of ridiculous situation where you stop at a red traffic light or something like that over here. Yeah, there's a red traffic light somewhere in the distance, you stop and you just never get started again. You just kind of stay there forever, basically. But those kinds of issues don't manifest anymore once you have fixed this kind of change point upgrading. Okay, with that, I think I'm gonna kind of skip the second of the two things I was gonna talk about and instead go to kind of a conclusion slide to kind of summarize broader learnings for applications of deep machine learning. So we've only seen one application. This was originally intended for if we had managed to get through both. But you can apply supervised learning to sequential decision making tasks. The one example that we've seen is imitation learning through behavior cloning. I was gonna show you a different example of model learning, but we haven't quite gotten to that. But the important thing to keep in mind is that the supervised learning objective in each case, in the imitation learning case, in particular that you've seen, does not directly maximize the reward. It instead optimizes something that's not perfectly aligned with the reward. It's kind of aligned and like we discussed, it might even be the case that the zero error solution to the supervised learning objective is also the optimal solution for the reward. But then it doesn't really tell you about how to distribute your errors and that's important. Distributing your errors means that you are able to figure out which samples in your training data are actually important. And that's really kind of the intuition for the solution that we developed, the broader intuition for the solution that we developed. And so the fix was to reformulate the supervised learning objective to focus on the important instance and time. And that's kind of the broad solution strategy that we used for the project that I showed you, but also we've had success using it in other cases. Okay, so broader lessons for applying deep ML approaches in the last couple of minutes. Lesson one, there isn't always readily available supervision for a task. So for example, for the sequential decision-making task, it's not always the case that you have access to expert data. Even when there is supervision, it might not always be perfectly aligned with what you actually want your machine learning model to achieve. So that's something important to keep in mind. Second, it's not just the case that your supervision might not be aligned. Your loss function literally might actually not be aligned with what you actually want your model to achieve. Your objective function might actually not be perfectly aligned with what your model has to achieve. In fact, in your choice of objective function, you're always limited by certain constraints. Like you want your objective function to be suitable for gradient descent, for example. You want it to be differentiable. But that might not actually be what your true performance measure is. Maybe your true performance measure does not satisfy that constraint that it is actually differentiable. But you somehow have to get around it. You have to come up with a proxy for the true reward measure that you can actually optimize. Or maybe you don't even have a single scalar performance measure. Maybe you have a performance measure that consists of five different things with some constraints on each of them and so on and so forth. It's non-trivial to formulate something like that as an objective function for your supervised learning problem. So you always end up with something that's kind of a compromise when you define your objective function almost always for any interesting problem. And so you should always try and evaluate on the true objective or the true objectives or as close to the true objectives as you can get. So for example, in our driving problem, the true objective would be the real driving performance that we measured using a bunch of different metrics. How far along it got on the road, how many times you had to intervene on the task and so on and so forth. That's really the eventual measure of goodness of your machine learning model, not just your last function. Lesson number three is that if your deep learning model can get away with finding degenerate or trivially simple solutions that can do almost as well or just as well as a good solution, then it will almost always find the trivial solution. So anytime you train your machine learning model on a dataset, if your data has some pathology, maybe it's from a very narrow part of the distribution or if your objective function has some pathology that permits a very trivial solution, then you can almost be guaranteed that the solution that your model will eventually end up finding will not be the solution you're looking for. It will be some kind of weird shortcut solution. And a really great way to figure out whether your model is learning the right thing is to try and evaluate it outside of the distribution that you're training on. So if you can create samples that are outside of the distribution, it might not always be applicable to every task, but for example, for our driving task, it was easy to create samples from outside of your distribution that you were still interested in doing well on. If you evaluate it on those outside of distribution samples, you can easily figure out whether your model, in fact, learned the right thing or it just ended up learning some spurious correlation that works on the training distribution. Any questions about that? Okay, finally, lesson four, or actually I think this is a penultimate lesson is you have an important toolkit in your arsenal where if you happen to know some way of deducing which is the important stuff in your data, then you can design your loss function to focus on that important stuff somehow. This can be as simple as saying that I actually know literally when the dataset is given to me, I happen to know that sample number of five and 18 are the most important samples in my training dataset and I just have to operate them. Or sometimes you have to do a little bit extra work like we did to try and identify automatically by studying the problem, what are the points of failure and why these points might correspond to important change points and then start upgrading them. And finally, lesson number five that I want to leave you with is that you cannot really use deep supervised machine learning as a black box even though it's very tempting to do so. The language of deep learning very often says here is a method where you don't have to hand engineer any features at all. You just throw in your inputs and you have this multi-layer deep neural network that will automatically discover representations and so on. Really all of the time we're injecting some structure that we know about the problem into our deep learning algorithms. So even the fact that we're using convolutions like somebody brought up earlier, it has something to do with something we know about images and the domain of images and their own invariance and equivariance. But it's important to even go beyond that. You have to go not just in the architecture but also in the loss function, in the training data. All of these are parameters that you're... All of these are design choices that you're making where you have the ability to inject some knowledge about the problem that you're studying and it can often give you important gains. So for example, for figuring out what is important in our setting, we had to actually go back, look at what was the pathology that was emerging in the solutions that we were learning the naive way, figured out that the thing that was going wrong had to do with copying the previous actions and then say that we have ways to identify that and then start using that as input when we design our loss functions. So you can think of the real input to your solution, to your machine learning algorithm as not just being a labeled data set and then throw it into a machine learning algorithm as a black box, but you also inject somehow some intuition that you have about your problem as input to the machine learning algorithm. Okay, with that, I think that's probably the cue for me to start wrapping up that are people coming in for the next class. Yeah, thank you all and I'm happy to talk about any questions that you might have. I'll stay around here for a couple of minutes. Did I actually get to all the questions on the chat? Yeah, just one more. Okay, let me see if I can quickly. How do we identify it? So very kind of