 I'm going to tell you about, share some thoughts about where I see challenges for current research in AI, machine learning, deep learning, and especially considering the agent perspective and questions of causality, which I think deep learning is going to embrace in order to move to the next stage. So, if we look at some of, you know, limitations of current approaches, one is sample complexity. In other words, how many examples we need to learn a particular task. It's true for current industrials provides learning, but it's even more true for reinforcement learning systems. If you compare to how many examples a human needs to learn a new game, for example, or a child needs to learn a new task. In addition, there are practical reasons why we need to tackle this if we're going to build things like robots or other machines that experience the real world where you can't have the luxury of a huge number of examples because each of them can be dangerous, lethal, and costly. And also in the real world, unfortunately, we don't have a perfect simulator. We don't have a simulator of humans, for example, with which we could train RL systems to do dialogue, which we would all dream to have. Another issue with current approaches is a lot of the high-level concepts and knowledge about the world is provided by humans who label data. And we have yet to achieve this dream that my colleagues and I put forward a few years ago of deep learning systems that discover by themselves the high-level abstractions of the kind we communicate with language. Also if you look at the kind of mistakes that current deep learning systems make, whether it's on images, text, whatever, you'll see that, well, when they work well, they work well, but when they fail, they fail in ways that are very different from humans that reveal somehow fairly superficial clues that they're taking advantage of rather than capturing the kind of high-level abstractions as humans use. So that's motivation. And if we want to dig a little bit deeper in where we are versus where we would like to go, one really nice division is to think about what psychologists called system one and system two tasks. So we can do things very quickly that are intuitive. This is called system one tasks, like perception, for example. And we do this very complex computation in an unconscious way. In other words, we can't explain to a machine how we do that, which is why deep learning and machine learning has been so useful to be able to tackle these kinds of tasks. But there are other tasks which classical AI try to achieve that are more of the kind which we do consciously. Things like reasoning, things like programming, building algorithms. These tasks are slow, logical, sequential, and current machine learning doesn't really address these things as well as I think they should. So one of the areas that I want to talk a lot about, but I think really would help us bridge the gap here, is not try to just do the system two task by themselves, but actually combine the strength of both sides with things like grounded language learning where we try to learn the meaning of sentences in a way that allows a system to catch what those words refer to in some kind of model of the world. So now let me tell you about the agent perspective for deep learning. Of course, there's deep reinforcement learning, which has been very successful in games, but mostly it was about how deep learning could be used as a black box to solve some of the problems of generalization that you find in reinforcement learning. So it's like reinforcement learning guys use deep learning and it helps to solve reinforcement learning problems like playing games. But what I'm talking about here is sort of going in the other direction. How does the perspective of a learning agent which can act in its environment change the way that we can design representation learning machines, deep learning systems that are supposed to be discovering good representations? So for one, once you consider the agent perspective, you have to go away from the classical machine learning framework of IID data and a fixed given data distribution. Instead, you have agents which interact with their environment and can change it, which means they change the distribution from which they're learning. That could be troublesome, but it's also one of the big messages in my talk today is it's actually a way to bring a lot of information to the learner that up to now we've considered a hindrance, these changes in distribution, whereas in fact they can help teach the learner about how the world works. So these changes in distribution arise because the agents do things, maybe the learner itself or other agents. I mean if we want to have machines that interact with humans, they will be doing things. And we know that for a lot of machine learning systems that we build in labs, when we bring them to the real world where the distribution is a bit different, there's a loss in performance. It's difficult to generalize out of distribution. In fact, all our theory breaks down. So one aspect of dealing with this I think is embracing the challenge by building machines that not only understand the particular data you give them, but somehow try to figure out the underlying causal structure and understanding of how the world works, which is behind that data. And furthermore, the agent perspective also gives us something really nice. It's the possibility for the learner to like inactive learning to purposely go after knowledge. So this is one of the things I'm most excited about. I won't have time to talk about this, but all of the work that is going on right now in reinforcement learning with what you could call unsupervised reinforcement learning, where agents explore the world in order to acquire knowledge, I think is going to be crucial in the future. So I said many of these things, but let me add a particular element which again touches on practical applications. So I was considering a few years ago the problem of training autonomous vehicles, and one of the issues that comes up is you have these rare situations, for example, accident situations that can matter a lot. And we don't have enough data of this. And so the systems we build today are it's difficult for them to generalize on these very, very rare cases, which are unlike the kind of normal training data they get. And how do humans manage to go around this problem? Well, even though I've never been in a really serious accident, thanks, you know, I mean, I'm really glad it didn't happen. I can imagine these things, so our capacity of imagining these maybe even impossible situations when we read fiction, we read science fiction, or we go to science fiction movies, and we can imagine these impossible scenarios. So this ability to imagine counterfactuals and context of causality I think is something that we need to build in our machine learning systems for a lot of good reasons. In the last year, one of the threads connected to what I've been talking about, which I want to mention a little bit, is changing the perspective of generative models. So I'm talking about imagination, but you know, we've been very successful with things like GANs and variational autoencoders to build systems that can generate images and now speech and things like this. But this is not the kind of imagination we have. The kind of imagination we have isn't producing pixels. It's producing these fuzzy abstract images in our mind, right? So that's the kind of imagination we need. And so we've been investigating machine learning methods, representation learning methods, that allow to generate and to learn features at the level of these unobserved latent variables. So one of the tools that we've been working on, which I find really interesting, is GAN-related methods, adversarial methods to estimate and maximize mutual information. So why is that relevant? Well, if you want to learn about how, let's see, this works, is there a pointer here? No, okay. So if you want to predict something that's going to happen in the future, but in the latent space, if you were to just do maximum likelihood or something like this, then when you backprop into the encoder that maps the low-level data to your representation, it could just learn to produce representations that are constant because they're easy to predict. So the usual objective function in latent space, if you want to make predictions in latent space, are not quite appropriate. They would collapse to something bad. And there may be several ways of dealing with this, but one of the approaches we find most exciting is instead of thinking about a prediction task, we think of it as maximizing mutual information between past representations and future representations. And that, because it also maximizes the entropy of the representations, prevents this collapse and does the right thing. So we've been working on these kinds of things. And looking at ways to maximize information between parts of the high-level representations that have the spatial location and their correspondence at different times and between those locations, those localized features, and global features, and making progress in learning in an unsupervised way, good features for enforcement learning tasks in particular. Now, let me focus a little bit more on these latent representations that we would like to be imagining with, I mean, we'd like our machines to imagine. If we do a little bit of introspection about our own imagination, as I said, it's not just that we're imagining these abstract things, but it's also that what we have in our mind when we imagine isn't like a movie of the future. So not only it's not at the pixel level, but it's also focused on just a few aspects of the world at a time. So if you consider your thoughts, they, at any particular moment, focus on a few aspects of the world. And then you project yourself into the future, maybe you're thinking about a car coming on your left, and you're not thinking about a zillion other things which could happen, might happen, will happen. You're only thinking about a few things that matter for your current decisions. So this kind of focus on a few aspects in your imagination and projection is very different from the usual machine learning idea of predicting the full distribution at the next time step. So this kind of exploration is something I introduced a couple of years ago in a paper called The Consciousness Prior, which I talked about last year, here as well. And what we're doing now with this is seeing how we can use these ideas to provide better priors on how the high level representations should be and how we could use this for things like planning. So, okay. And so if we're going to be selecting a few dimensions of the high level state on which we're going to focus computation, well, we need attention mechanisms. And it turns out one of the greatest and I think undervalued advances in machine learning in the last few years is the development of attention mechanisms. In fact, if you look in NLP state-of-the-art systems today, almost all of them use attention mechanisms. And these attention mechanisms allow basically to focus computation on a few things at a time. This is it. And you can see how this is central to the idea of system two computation I was telling you about, where we are going to sequentially focus on a few things that matter. So that's very closely related to this Consciousness Prior, but also to the general task of reasoning and so on. And because those attention mechanisms are soft attention, so it's like graded attention on many things, you can still use backprop to train these things. Oh, yeah. There's another thing that's really important about attention mechanisms. It allows to change the nature of what neural nets are doing. So classical neural nets are vector processing machines. I mean, even images are seen as like big vectors with a topology. Once you introduce attention mechanism, what it allows to do is to process sets, right? Because if I can focus my attention on a few elements, that means I can select them from a set and I can now process sets and generate sets and transform sets. This is what transformers are about. So working on sets, working on objects again is something that makes a lot of sense for these kind of high level processing that we want to achieve. So in the Consciousness Prior idea, one of the proposals is in terms of architecture, in terms of how we compute and do inference with these ideas, is the idea that in addition to the usual mapping from input to some high level representation, which now I'm going to call the unconscious state, there's this attention mechanism which selects from the unconscious state a few relevant elements, a few objects from that very, very large set to produce a small set that is going to be the current conscious thought or maybe an imagined state of the world in the future. So this conscious state is very small. It's the kind of thing you might express in a single sentence, right? And there would be a lot of connection with natural language understanding because the objects you manipulate in the conscious states are like words and concepts we name. Another way to think about this from a graphical model's perspective, from a sort of abstract probabilistic perspective is because we focus on a few variables at a time, the only kinds of dependencies we can easily represent are involving very few variables at a time. And there's a way to capture that in a graphical model's thinking, is simply to say that the joint distribution of this very high dimensional unconscious state, all of the variables we could focus on, is a sparse factor graph. So factor graph is a graphical model in which the joint is decomposed into a product of potential functions, each of which touches some subset of variables. And you can think of these potential functions as like little constraints that tie a few variables together. And sparse here means that these potential functions only look at a few variables rather than all of them at the same time. So that's one very, very simple way to encapsulate this notion. And that sparsity constraint you can think of as a prior. It's a constraint on the kind of variables that we want to have at that level. So pixels don't enjoy this constraint. If you try to capture the joint dependency between pixels, you're going to need to essentially look at almost all of the pixels in the image to get some sensible prediction about one pixel given other pixels. Whereas if I use these high level variables, like if I were to drop this, I'm going to be able to catch it and well, I wouldn't let it fall on the ground like in the slide. So these kinds of statements can be made with very, very high probability. So not only these involve very few variables, but there are strong statements like I can predict things with very, very high certainty. And that creates a pressure on the kinds of variables that we can represent at that level. Another source of constraint that we can put on these high level representations come from the agent perspective. So agents can do things in the world. And in a deep learning framework, these agents want to represent both the states of the world. This is what I've been talking about up to now. But they can also represent their policies, their intentions, their goals, which have to do with the kind of actions that they can implement. So when I was saying I can drop this, what happened is I on the fly constructed a policy for achieving a goal which was to drop this with this hand and catch it with this hand. So this information was represented in my brain and it would be represented in some distributed way. And I'd like to know how to represent that kind of intent information. At the same time, I'm representing the objects on which these policies are operating. And there should be a relationship between these two things. So if I have a policy for catching this, this policy should have sort of obvious dimensions that have to do with the object position. Because these are the quantities that matter for that policy. So this led to ideas in the last couple of years that we started in 2017 and more recent work where researchers think of these two spaces, the spaces of actions, intentions, options, policies, and the spaces of states of the world. And one nice simple thing we could say is we'd like to maximize mutual information between these two types of representation. If you think about it, the extreme of maximizing mutual information is a one-to-one mapping. So in other words, for the Z position of this object, there is a policy that I can create, which manipulates it. And so for any factor in the world that I can control, there is a policy that it directly is associated with it. So maximizing information means I can predict one from the other perfectly. So if you see me doing this action, you can predict what was my intention. And if you ask me to do it, I can find out in my head the policy that will do it. So there is a two-way connection between the two. Yes. So this is connected to the notion of generalizing beyond the training distribution that I mentioned at the beginning. And it's connected to another fundamental limitation of current machine learning, which has to do with how we organize knowledge. I'm going to try to explain how this is so. But one of the most exciting advances in deep learning in the last couple of years is the progress in meta-learning. And meta-learning essentially is learning to learn, have an inner loop of optimization and learning, and an outer loop of optimization and learning. And why is that relevant here when you're thinking about agents changing, moving the world? Because now we can consider those changes and giving rise to many distributions, many types of relationships between the variables as different meta-examples. So I'll come back to that. But if we want to connect different environments corresponding to different distributions, if we want to be able to generalize to a new environment from the environments we've seen during training, we need to think about the link, the connection, between those environments. If I say nothing about how the world might be in the future, I can't have much certainty about my predictions about the future making sense. And it helps in machine learning to think about hypotheses we can make about the world that allow generalization to happen. Now it's a kind of generalization across distributions as the world changes and we consider different environments. And one of these hypotheses that I find the most interesting is the hypothesis that Bernard Schopkoff communicated to me and that I read in his book 2017 on causality, which I suggest very strongly. And it's the idea that comes actually from physics of independent mechanisms, the idea that we can explain the world by the composition of a bunch of small pieces that compute for real and that the different pieces, which they call mechanisms, are in an information sense independent of each other so that what you learn from one doesn't tell you much about another. And if one changes, you don't need to change the others in order to continue modeling the world properly. So in our work, we have been writing on the idea that most of the changes that happen in the environment, as you go from one environment to another environment due to actions of agents or whatever, are localized. There are small changes when you express the structure of the model in the right way. So let me illustrate this with a picture. Right. So think of these little circles as parts of a big model. And each is going to have its own parameters that capture some aspect of the world, like some conditional dependencies, for example. Now something happens that changes the environment. Somebody comes and does something. And the claim is with the right way of dividing our knowledge into pieces, only a few parts of our model needs to change to account for what happened. And the reason this makes sense is that we are agents that are physical agents. We tend to influence the world in the first place at a particular time and place. So our effects are going to be localized. Of course, then there might be chain consequences. But if we model things in a proper way, we can explain that change with very few changes in parameters or very few inferences. And so we don't need a lot of data to make sense of the change in distribution. As a case study to understand this, in a recent paper, we've looked at something really, really trivial, which is how does an agent deal with the change in distribution when there is a joint distribution between two variables, A and B. And maybe one is a cause and the other is the effect. And somebody is going to come and change the prior distribution, the marginal distribution of the cause. So it's sort of a very standard, simple scenario. And we can see in this scenario what happens if you have the right model, which separates the joint into P of A times P of B given A versus the wrong model that separates it into P of B times P of A given B. In terms of the joint distribution, it doesn't change anything either way you do it. But in terms of what happens when there's a change in the solution, it completely changes the game. And in one case, you need to modify all of the parameters to account for the change in the cause. Whereas in the other case, you only need to change the parameters of the prior P of A. What happens today with today's models is that they want to explain the change in the data by modifying all of the weights of a big neural net. And so you're going to need a lot of data to account for those changes. But if we are able to modularize the knowledge into smaller pieces that have this sort of independence notion built in, then you could, I think, reduce a lot the issues of catastrophic forgetting, poor transfer, domain adaptation, and so on. So one of the things that we've been starting to look at as well in this context that previous work on causality hasn't looked at very much is where do we get the causal variables in the first place? So classical work on causality, which is meant to help people in social sciences, healthcare, and so on, assumes that some scientist is going to give us the variables. So there's the smoking variable and the cancer variable, and we can observe them. But of course, for an AI system like a robot, it doesn't work like this. The baby just watches pixels and sounds. And from those lower level things have to infer the high level causal variables. So this is sort of, I think, a new task for causality, which makes a lot of sense in AI, which the kind of questions I've been asking, I think, can help us deal with. So I don't have a lot of time left. Yeah, let me skip this. So we wrote a paper, which is being rejected from Europe, of course. But it's okay. I've got nine others accepted. But this was the best one, anyways. Which looks at this question and tries to turn the changes in distribution from a hindrance to a signal to learn about causal structure and learn about how to modularize knowledge. And here I want to quote Leon Boutou, one of my friends, who's also working on causality these days. You can see it's a hot topic. He gave a keynote at ICML and he said, nature does not shuffle environments. So when he says environments, he means distributions of data. And so we shouldn't. And the reason he was saying this is that there is really, really important information in those changes in distribution. So for example, in our paper, we exploit those changes in distribution, just the fact that there was a change in distribution, in order to discover whether A causes B or B causes A. But more generally, in the changes in distribution, there is information about what is it about the world which is stable across distribution versus what is not. And things like the causal structure tends to be stable, right? So very, very fundamental knowledge about how the world works is going to be invariant to changes in distribution. The set of variables, which I call the causal variables, these high level variables, on which we should be doing the computation, is also something stable, right? So maybe the values of these variables change or their marginal distribution changes, but which variable matter in general is something stable, right? So there is a lot of information that we're currently throwing away when we take our data sets and we do the usual thing of shuffling the data. So we get one IID distribution. So we have to change our ways. So I'm not going to give you a lecture on meta learning, but if you don't know about it, you should learn about it because this is a way that we can deal with this problem of changes in environments as sort of meta examples by which we can optimize things like what's the right way of majorizing our knowledge, which variables are cause and effect for which variables and so on. And we've been exploring that as well to detect what are the right variables, but I'm going to quickly approach my conclusion. So this little work that I mentioned at the end is also interesting from a cognitive perspective. One of the questions I've been asking myself for a long time is how do infants, which cannot act much in the world, like they're very passive, they can move their eyes and they can cry. How could they capture a causal structure in the world? And our experiments suggest that in fact just being a passive observer, but being on the lookout for these changes in distribution, for example, parents do things and then the world has changed, can provide information about the causal structure. In addition, if the learner, even the passive learner tries to infer what were those changes, what happened, the parent did this, then the causal inference can be much more efficient. So we have another paper which is going to be submitted to iClear where we find that we can scale that causal learning much more efficiently if the learner tries to figure out which variable was modified. So the learner doesn't need to be told what the intervention was, but it can infer it. So I'm going to close with this slide to tell you a little bit of where my group is going, looking forward. We want to train systems, learning agents, which build a world model that captures the causal structure, but also the right space of variables, of abstract variables on which one can reason and which can be a good representation for accounting for the changes in distribution that happened in the real world due to other agents doing things and so making it possible to get much better out of distribution generalization. I mentioned also how this can be important for exploratory behavior. So if I have a sort of self-knowledge of the things I know and I don't know about the causal structure, I can use that to decide where to explore in order to acquire more knowledge about how the world works. Okay, thank you very much.