 In the last example, we started to move toward causality, building models that allow you to ask counterfactual questions, not just prediction, but what would happen if. Now causality I think is the holy grail. Causal models generalize. That's why physics works so well. It's always the case. Correlations can change from year to year, but if you have a good causal model, this causes this to happen sort of by definition that will generalize to different settings, different cases, different times. These statistics may change if you go from the Wall Street Journal to the New York Times to the Philadelphia Inquirer, the frequencies of word change, but sort of the underlying structure of the syntax should be the same thing and the underlying motivations that describe why people say things. If it's a causal model, it should generalize. Secondly, often we want to take actions. We want to do something and say, what should I do to make this happen? Now we've seen that happen in reinforcement learning, which is about learning actions, but we sort of divorced that from most of our deep learning where we learn correlations that may not be causal. So think of the simplest example, I've got our room, it's wintertime, it's Philadelphia. If the heat is on, what's the room temperature? Probably low. If the heat is off, what's the room temperature? Probably high. Does that mean that I should, what's the causal there? Is the heat being on causing room temperature or is the room temperature causing the heat? Okay, the room temperature is causing the heat. The cold temperature of the room causes the thermostat to go on. We want to know which way we should move these things because in the end we want a model that learns, hey, here's how to adjust the room temperature. And again, we've seen reinforcement learning as this, but if we're just observing the world, it's really hard to tease out causality. Now, finally, I think that causal models mostly live in a more symbolic or discrete space. They don't live in pixel space. They live in objects banging into each other space where they're discrete objects with sizes and masses and positions. And I think that if one wants to learn models that will generalize well with not too much data, then one has to at some sense move toward this more sparse or symbolic representation. One of the many approaches to do that is called recurrent independent mechanisms, yet another approach to try and drive modularity into systems. So imagine, again, something where you're showing your computer a sequence of visual inputs as it watches the world. It's going to then learn in its neural net a set of hidden state each of which the module, the trick which we're going to say is from time to time, one time step to the next, there'll be a sparse communication. There'll be some sort of a attention mechanism learned, of course, through gradient descent. And the attention mechanism will say you only get to pass on information from a small number of these different modules in your hidden state that's passed the next input for the time plus one. So this is a very particular kind of recurrent neural net, rather than full connectivity, we're driving and forcing a sparsity. And the sparsity tends to make models that will generalize better, that will do less overfitting to the particular correlation structure that one is trained on. And we sort of kind of hope maybe we'll learn things that are more causal. They certainly learn things that generalize better on at least the few experiments that have been done so far.