 Welcome to my channel, today you are going to be learning about model predictive policy learning with uncertainty regularization for driving in dense traffic. I know this is a mouthful of a title, but you'll be able to understand every singular part at the end of this talk. This is work done by Mikael Hinaf, myself, Alfredo Canziani, with a supervision of Yanlecan from NYU Courant. You can find me on Twitter, under AlffCNZ. Let's start by seeing what the policy does. So here we can see an agent represented by a pink drawing of a brain, which is performing an action AT, given that it finds itself in a state ST. After performing a specific action, he observes a consequence or cost CT. An agent interacts with an environment, here represented by a picture of the world. The real world is going to tell me what is going to be the next state, even that we start in a specific initial state and the action AT is performed. Moreover, it will provide us the information about the cost associated to a specific action. A policy defines the behavior of an agent at a given point in time. When we learn a policy by interaction with the real world, we talk about model 3 reinforcement learning. Model 3 reinforcement learning is very effective, but needs lots of environment interactions to learn from. It can be slow, expensive or dangerous. If you think about trying to drive a car in the real world. Model 3 reinforcement learning instead learns a policy by interacting with a world model, here represented by a blue drawing of the world. In order to train this world model, we simply have to try to copy the real world, here represented with a picture of the Earth. So we start by sharing the initial state ST with T equals 0. We then provide to the system the action AT, performed perhaps by an expert. And then we observe what is the unfolding of the future. So we are going to have an ST and a given cost CT. All we try to do here with the world model is going to be basically coming up with an S hat T and C hat T. And we are going to try to make those predictions as close as possible to the results shown by the real world. Sometimes the cost is an intrinsic and differentiable objective defined on a specific state. So we can take it out from the equation here and we just in this case have to learn next ST. In this case we talk about model predictive control. We still have an agent, which is going to be performing an action AT, even if it defines itself in a state ST. And on the other side, it interacts with a world model. We define an intrinsic and differentiable objective, ask cost, here provides us the C hat T. So we have talked now about three elements so far, ST, the state, the action AT and the cost CT. Let's start to see in more detail what these are for the case of autonomous driving. So we have seven cameras mounted on the top of a 30 story building facing a highway. In here we have just two pictures of how the cameras view look like. I have a point view transformation to fix the perspective. We have some tracking and identification of each vehicle. The same vehicles appear in both representation as we can see here shown by the yellow arrows. We have a third representation here, which is using the simulator I wrote. We can notice here how the same vehicle appear also in this representation. For every vehicle we have a vector PT representing the X and Y position and a vector VT representing the X and Y velocities. Moreover, we have another vector AT representing the acceleration in the longitudinal and in the transverse direction, so X and Y. We have now a final representation used for training our networks. Every vehicle here is represented with a green box. The lanes are red and the background is black. We can notice here how the same vehicles are mapped down to this last representation. For every vehicle I crop a rectangle showing me what is the context around every vehicle. So it shows me the lanes, the markings, and it also shows me what is the situation of the traffic of the other vehicles basically around myself. Moreover, I also place myself on a different level on the blue channel and so we can see here the central car became blue or the bus here is shown in the upper right corner. These are making my IT images a time T. So finally the set of position PT in the velocity VT and the images IT represent my state ST, whereas AT, my X and Y acceleration are representing my actions. We have two costs. We have a lane cost and a proximity cost. Let's start with the lane cost. Here I'm showing our own vehicle in blue and aligned to the X-axis. We have a potential here for the lanes so that when we plug this in our reference on the map we can see the intersection of the lane with the potential is non-zero if the car is not in the center of the lane. And we just get this value as our cost associated to the lane. For the proximity cost we do something similar but relatively to the basically closest car. So we have again our blue vehicle which is our self and we have a potential in the Y direction and we have also potential in the X direction. While the potential in the Y direction is fixed the potential in the X direction changes with the velocity of our vehicle. So if we plug this inside the simulator we're gonna see that the car here has a Y potential, has an X potential and then by performing the product of the two we can come up with the proximity cost. If there are multiple vehicles we can just pick the maximum. Let's see how we can reformulate the proximity cost in order to make it differentiable. So here I'm showing you a situation where we are moving around 20 km per hour on the left hand side whereas on the right hand side we move at 50 km per hour. You can see how vehicles are more distant one to each other on the right hand side. So on the left hand side if I multiply those two potentials I get the following image and the same for the right hand side. As I said before the potential in the X direction changes with the speed so when we go at 50 km per hour our potential is much longer. In order to get the final proximity cost we multiply pixel by pixel the map and the green channel and then we just take the max. And the nice point here is that it's differentiable. We can directly use it with our neural network and train it in order to minimize such cost. So let's see now how we train the world model predicting what's next given history and an action. So the world model gets as input a sequence of states S1 to T. Here each state is represented by the combination of 2D vector Pt for the position a 2D vector Vt for the velocity and the context image I.T. Representing both the situation of the traffic and the lanes that are surrounding us. Moreover it also gets the action A.T. perhaps performed by an expert. And it tries to tell us what happens next S hat T plus 1. The hat represents the prediction. On the other side we have the real world which has real observations of the real futures which are our targets S T plus 1. So let's try to start learning the evolution of the future with a deterministic predictive network. So we have our states our action which are fed inside a predictor. Then we get a decoder and we are getting the prediction S hat T plus 1. On the other side we have the targets S T plus 1 and we try to minimize MSC loss function. Here on the left hand side we can see an evolution of a real future. And on the right hand side we see how our deterministic model comes up with its own evolution of the future. The frame rate is 10 Hertz. We can see now that we are 5 seconds in the future. 10 seconds in the future and everything is just a mess. If we keep going we just see that things are getting very blurred out and especially when we try to turn a little bit everything just doesn't work. So what's the main problem here? Well we have to keep in mind that the future is multimodal. Let's think about having a pencil here on this plane. Let's have actually a top-down view on the right hand side. So I'm gonna make it fall several times and the pencil will fall in different directions. If I just try to learn this multimodal future with a deterministic or unimodal network well the network will say on average well it's never fell and we know this to be wrong. So we cannot use a unimodal network for learning a multimodal future. So we need to change something and so here we are introducing the variational predictive network. So far I haven't changed anything so this one we said gives us those blurry predictions. In the hidden state here I'm gonna sum something which is a low-dimensional latent variable which I expand in order to match the dimensionality. During training the latent variable is computed in the following manner. From the actual target ST plus 1 we send it to an encoder which is gonna be telling me what is the mean and the variance of a Gaussian distribution from which we are gonna be sampling. And so this is very similar to what has been done in a variational latent encoder. If you are not familiar with such network check out my lessons about variational latent coders. The link is posted below. So these two blocks represent the posterior distribution Q5 which is a probability density function of Z given my history as 1 to T and the actual future as T plus 1. And so we enforce some normality by enforcing some closeness between the prior distribution and the posterior distribution with the KL term here. So in this case we have to minimize two losses. One is the MSC and the other one is gonna be the KL. This is how we train this network. Let's see how we actually use it for when we don't have access to the target. So here we have this low dimension latent variable in green which is gonna be simply sampled from the prior distribution. Then we get the next prediction the S hat T plus 1. We feed the S hat T plus 1 to the input. We sample a new low dimensional latent variable and we get the S hat T plus 2. We put it back in the input and so on we are gonna get our sequence of predictions which are gonna be basically unrolling a potential future. So let's see how it looks like. On the left hand side we have the actual future which actually have happened. On the right hand side we can see four different evolutions of the future. Here we have sampled four sequences of 200 latent variable from the prior distribution. Pay attention to the car next to us in the white circle and the car behind it in the white square. You can see how differently the network predicts their evolution. None of them will actually repeat the actual future and they will show four different variants of the future. And if we compare these results to the deterministic model we can see how they are crisp and they are quite realistic as well. After training this network we notice a first problem. There is a action insensitivity from the network which means that the action effects are actually encoded through the latent variable by the encoder here. This is a problem because in this case we already know what the action consequences are and therefore we won't be paying attention to what is the current action fed to the predictive network. Here we are going to be observing a four-time spade-up version of the reality where we have the real sequence of latent variable z and the real sequence of actions. We can notice here how the driver steers a few times. And we show it two times here. On the left hand side we see now two different samples of latent variables and the real sequence of actions. In this case the car should be steering the same way, but it does not. Finally here we have the real sequence of latent variables but a different sequence of actions. And unfortunately here we can notice how the steering behavior is encoded in the latent variable rather than in the action sequence. In order to fix this problem it's going to be trying to interrupt this kind of flow of information going from the encoder network to the output. We do this by applying a dropout to the latent variable during training of the forward model. In this case the low-dimensional latent variable is not always sampled from the posterior distribution but is also sampled sometimes from the prior distribution. In this way we gain now sensitivity to the current action fed to the predictive model. As we can notice here on the right hand side we have two sequences of latent variables that one and that two and we have the real sequences of action. We can notice now how the steering behavior of the reality is matched by these two on the right hand side. So great, now we have a way to predict a multimodal future by using an internal switch which is this latent variable which allows us to switch our main deterministic network to learn a specific mode and act in a specific mode. So let's try now to see how we can learn the agent and distill a policy. Here the agent is represented by the letter pi which stands for policy. All a policy does is mapping an input state. For example here is a sequence of states from S1 to T where every state is represented by a 2D vector pt for the position, a 2D vector vt for the velocity and a context image it. And then it maps this input state to an action a hat t which represents acceleration or break and right or left steering which is the control. Training the agent in a naive way. So we have our policy which is mapping the input state to a given action. And then we have the word model. The word model needs to be fed with a sample from the low dimensional latent variable and therefore it produces a prediction has hat t plus one. From this prediction we are gonna be computing the loss and my loss in this case is gonna be simply the cost task we have defined before. So it's gonna be the proximity cost plus some lambda lane times the cost associated to the lane proximity. And then what next? Then we keep going. So we feed this prediction s hat t plus one to the policy again which is gonna give us the next action a hat t plus one. Once more we provide the s hat t plus one and the new action to the word model. We sample new low dimensional latent variable from the prior distribution we therefore get the prediction s hat t plus two. We apply again our loss and we keep going. So we feed this to the policy. We get the next action a hat t plus two. We provide both s hat t plus two and the action a hat t plus two to the word model. We sample a new low dimensional latent variable and we get finally the last prediction s hat t plus three on which we compute the loss and therefore we can just run back propagation which is very similar to how back propagation through time works. We have here unroll a sequence of states and actions which are representing my trajectory. And then we just run back propagation through all the trajectory from the future back to the initial conditions. Does it work? Nope. So basically what happens is that while we are trying to minimize that cost the network will try to cheat by coming up with states prediction which are corrupted and therefore are minimizing the cost maybe because they are setting all pixels to zero for example. We are falling outside the manifold. We can notice here how we end up crashing outside the street or against other vehicles in the last case. So let's see how we can fix this and we can try to imitate for example the expert we have observing our data. So expert imitation, what do we do? Our loss is gonna be defined as the C task which was the sum of the proximity lane cost and then we sum some expert regularization. For this case we are gonna be trying to match what happened in the past. So we're trying to also minimize the distance between my targets and my predictions. In this case it works better if we actually do not use the latent variable because it's gonna be giving us the prediction which is as close as possible to the actual target. Does this work? Well it does actually. And we can see here the yellow trajectories that we are staying within the lanes and we are not falling outside the manifold. Are there other ways which are more generic? Yeah, so we can try to find a different manifold attractor so that we can avoid falling outside the manifold. So we are gonna be starting with our word model which is getting fed with the state, sequence of states and the action and produces the prediction s hat t plus one. From here we can compute the task cost and we have the ct plus one. Here the true cost is represented with the green dashed line and the training samples are represented with a red dot. If we go outside the training samples we can see that the predictions of multiple networks train on the same data but differently initialized are arbitrary and no longer represent what is the training function. To avoid going into these regions we can notice that here the variance between the different predictions is actually very high. By minimizing the variance in our cost lead our agent to stay in the comfort zone which means in the area where the data has been actually observed. So we can introduce here our regularizer which is the uncertainty regularizer. So how do we compute this uncertainty? So the initial sequence of states from S1 to ST it's fed to the controller to our policy which comes up with a action AT. Now these two are fed to the word model. We sample a latent variable ZT and therefore we come up with our prediction s hat t plus one. This is fed to the task cost which is the linear combination of the proximity and the length cost. Inside the word model we have some dropout modules. We turn them on during evaluation time. So we end up for example having the different dropout masks which will produce therefore the different predictions S tilde t plus one from one to D. We can therefore here compute the variance and multiply this by lambda and this is gonna be our model uncertainty regularizer. We sum these to the original task cost and therefore we come up with our new objective function to minimize. So all we need to do here is gonna be defining our loss as our C task our cost task plus some lambda multiplied by our uncertainty regularizer. And this just works. So how do we evaluate this policy? In this case we are gonna be taking one car the yellow one in this case replace the observed trajectory with our own policy. The yellow represent the original car the blue will represent our own car which is driven by our controller and the green are all the other cars which are blind to us because they are replaying their original trajectories. So we can notice now as we are driving that the yellow car actually chose a different trajectory and we are now stuck in a situation where no one sees us and we still had to survive until the end of the screen. In this case we said we are in front of the yellow car and we are just in between the preceding and the following vehicle. And now we are speeding up because there was no one in front of us. And in the third case again we are gonna be choosing a different path because we are just pushed by the other vehicles. And now we are by our own no one sees us and we still have to try to survive this dense traffic scenario. I promise you we would be understanding every term in the title. The main contributions are the uncertainty regularization the latent dropout for improving the action sensitivity the large scale data set of driving behavior from traffic camera. And finally we can additionally copy the past using the MPE expert regularization which is basically a model based imitation learning approach. Finally some info about this project. We've been talking about prediction and policy learning under uncertainty or WPW. So the speaker was me Alfredo Canziani. And again you can find me on Twitter under AlffCNZ if you'd like to keep up with all the work I'm doing in machine learning. This project was done in collaboration with Mikael Henaf and Yanlecan. Both of them on Twitter now. These slides are available at this link. The article on open review is available on this link. There is another version on archive. The code is available in pytorch at the following link. The website is the following and you can also find the poster online as well. If you like this lesson please feel free to like this video and if you'd like to keep be posted with new material please subscribe to the channel and enable the notification bell. Thank you again for listening. Bye.