 You know, Yashua is the head of the Montreal Institute for Learning Algorithms at the University of Montreal. Truly needs no introduction, so I think I will step down at this point. Thank you, Yashua. Hello. So I'm going to tell you about fairly high-level thoughts regarding approaching human-level AI. Let's see. In particular, focus on the question of language understanding, which is one of the key ingredients for building intelligent machines. Which, as Josh was saying this morning, I think we've underestimated how hard that's going to be, and maybe we haven't taken the right direction. A lot of natural language research these days is done with machine learning and large-text corpora, sometimes huge corpora. For example, for machine translation, something I've been working on. And what I'm going to tell you is I think even though we can make some incremental progress by continuing this way, it's not going to be enough. It's not going to be enough to build machines which actually understand language. So it's a fairly simple message. Hopefully you'll get. So I remember seeing some talks a long time ago saying if we were able to train a really good language model, then it must mean that the model is capturing the underlying meaning. And so it sounds like we just just train on text corpora because to predict the next word correctly, we need to understand the rest of the sentence. Well, so that's nice in theory, but unfortunately when you train language models, and by the way, this is true also of other NLP tasks, what happens is the models manage to get their objective function to a pretty low value. In fact, in the case of language models, pretty close to human equivalent estimation of perplexity. And yet they don't seem to capture these high-level understanding of the world which is necessary to really do a good job. So that's something that got me thinking. And of course, one way to see what goes wrong in many of these systems is to look at the mistakes they make. Whether it's language or image related, we can look at the mistakes and often we see that it makes us see how limited the current systems are. And a phrase that's been used by many people to talk about what's missing is common sense. But of course, that can mean different things to different people. But essentially there are, in the case of natural language, these sentences like the Winograd Schemas, which humans can correctly interpret. For example, the women stopped taking pills because they were pregnant. And the question is, is it the women or the pills? So here it's pretty obvious. And if I were to change pregnant to carcinogenic, then it would change the answer. Current state-of-the-art machine learning systems do barely better than chance on these kinds of questions. So they don't have an understanding of the world around us. And again, we're using a video that Josh used. It looks like our systems currently are lacking in the things that a two or three-year-old gets very easily about the world. So whether it's intuitive physics or intuitive psychology, these, so what are these things? So they're not things that the babies have been able to formalize. Like parents don't need to teach their children about this. They figure it out by themselves somehow. Maybe some of it is innate and some of it is learned. We don't know exactly where the boundary is. But they are able to get those skills right at a level which is pre-linguistic, right? So all of us have a lot of intuitive knowledge about the world, which is part of that common sense, which we find very difficult to communicate through language. And I'll come back to this. So if we're asking how we could build machines that have the same level of understanding of language as humans, we need to sort of try to zoom in on what does that mean to understand a question or a document. And to me, the most basic part of the answer is knowledge, right? So in order to really make sense of these sentences or these questions or and so on, the computer needs knowledge. And the question is that machine learning and AI in general has been trying to answer for many decades is how do we get that knowledge into the computer, right? So to help us see some of the limitations of our current approaches based on large text corpora, let me go through a simple thought experiment with you. So let's assume you're traveling in space and you arrive at a planet and you're trying to figure out the language of the aliens and you're able to observe the bits of information that aliens are exchanging with each other. So they're language, right? So you could do language modeling by observing those streams of bits. Well, unfortunately, there's a slight difference on that planet and the way that they communicate compared to Earth. So the difference is that the aliens are able to communicate through a noise-free channel, which is not the case for us. We have speech, which is very noisy and so on. And both aliens and humans have also a cost to pay for the use of bandwidth. So they're going to try to compress their message as much as possible. However, in their case, because they have a noise-free channel, they can fully compress the signal. And so if you just observe the bits that are being sent, it would just look like random bits. So in other words, just observing the text is not going to give us anything about the meaning, right? This is really important. Of course, you're going to say, well, for us, it's different. But maybe my hypothesis is, yes, we're getting some information by just modeling text. And in fact, you can see some semantic information in word vectors and things like that. But maybe we're only getting part of it. And even if we saw an infinite amount of text, we would never get to the bottom of the level of understanding of those texts, which we have. So how could we understand this alien language? What's the solution? Well, we need to do a bit more work. It's not going to be enough to look at the bits that they're exchanging. We need to try to understand their intentions, trying to understand their context. So we need to look at modeling what they're doing and trying to figure out the causes of their communications and actions. Of course, this is much harder. But I think this is the problem we have in AI right now. We're lazy. We're greedy. And we're trying to build something that will solve the AI problem within the next six months for the next conference deadline. So it's just not going to work. We have to invest on solving these hard problems which could take decades or centuries. And in the case of the alien world problem, well, it's hard. We have to understand the alien society and we have to buy the bullet. So for AI, what it means is that if we want to do natural language understanding, we have to do modeling the world. And that includes vision, but understanding social interactions, and many, many other things that currently, if you talk to natural language persons, it's not really part of what they're explicitly trying to do. So it's pretty ambitious and it might take a lot of time before we solve these problems. One interesting question is, should we first solve the understand the world problem? And then once we've dealt with that, we sort of tack the language part on top of it. And forget about natural language for the next 30 years. Or should we jointly try to learn about the world and about language? So my inclination is that we should do both together and the motivation for this. And some people disagree with me, that's fine. My inclination for this is that we can get some clues about how the world works by looking at what humans say in some context. So I think there's some evidence by looking at supervised versus unsupervised learning and deep learning, where we see that the high level features that are learned by supervised learning, say on ImageNet, are actually much better in terms of capturing high level semantic information than those that we are currently able to learn with unsupervised learning methods of all kinds that we know. And I think one basic reason is because when we train those systems with just the word labels, we're already giving high level semantic information about the concepts that matter to explain things in the world. And so we're injecting that extra knowledge. So that's one reason. Another reason is thinking about cultural evolution. So this is something I thought about a few years ago. We can think of how language and culture has evolved as a big optimization problem where it's not just an individual brain trying to figure out how the world works, but it's a whole community or a whole group of humans through generations which are trying to decipher how the world works and using language and culture to help each other. So in this context, language could be a crucial tool also for machines. In other words, in the same way that a single human trying to figure out how the world works without the help of any other human might be experiencing a big challenge and might stay fairly dumb for the rest of their life, maybe we will need humans to teach and to provide some clues about the world to machines just like in Hal's story. Alright, so the system one versus system two distinction was mentioned earlier this morning and I think it's a very useful one here to think about these questions. So Kanaman and others have tried to separate different kinds of cognitive tasks into system one and system two tasks. The system one tasks are those that you can do very quickly like in half a second like object recognition for example. And they're intuitive, they're fast, they're often heuristic so they might be imperfect but they get the job done quickly and usually they're not linguistic. It's hard for us to explain why this is not a phone even though it might look a little bit like one. And actually this is touching on an interesting aspect which is that there's a lot of knowledge about the world which is encapsulated in our system one computation to which we don't have conscious access and that means that this knowledge is hardly represented explicitly in language so we could collect as much text as we want about people exchanging information we might still be missing some part of the knowledge that is in our brain that is represented in say the system one aspects of our mental computation and because we don't need to exchange about it all of us know about intuitive physics and intuitive psychology without consciously knowing about it and being able to verbalize it it would be very difficult for us to provide machines with that kind of knowledge because even though we have it in our head we don't know how to express it this is why classical expert systems I think have failed in addition to the lack of modeling uncertainty the lack of being able to formalize all kinds of knowledge which is happening in the system one computation is a big issue so system two is everything else right the stuff that we do that is slow sequential logical conscious we can talk about it linguistic and things like designing algorithms right so these these are the things that we are good at in computer science these are the things that we're good at with logic and these are the things that classical AI symbolic AI was trying to deal with so I think we obviously need to solve those two problems and I think that grounded language learning is a direction of research which would allow us to really get to systems that have both system one and system two capability so so grounded in some environment and some observations and interactions with an environment so that part is sort of the bottom-up part is system one and we want to associate that with meaning and language so for that purpose there's a whole direction of research in machine learning and especially in deep learning that typically comes under the token of deep reinforcement learning where what people are trying to do is to design learning frameworks for agents and they would then test these learning frameworks in virtual environments so so these are agents they're not passively observing and I think this is a crucial ingredient that in past work in deep learning we haven't done enough there was a discussion a little bit about causality earlier and intervention and counterfactuals and so on and so this aspect of understanding the causal structure at least to some extent by being able to intervene and seeing the effects of my actions I think is something that the deep learning community is starting to pay attention to but much more needs to be done now there is a common criticism regarding that type of research which is oh you're doing all this in virtual environments and it's not realistic the real world is is much more complicated so so my answer to this is well we're very far from human level AI what we're really after here is not to actually put the knowledge that I was talking about in the computer this is this is of course the ultimate goal but the short term goal is to design learning mechanisms learning procedure learning frameworks and learning frameworks are fairly general at least we try to make them as general as possible which means that if we have something that cannot even learn in a fairly simplified environment like these 3D things it's very likely that it's not going to work in the real world so we have to figure out how to walk before we can figure out how to run and there's also interesting research called SIM2Real where people train these models these neural net models on virtual environments 3D virtual environments and then there's some domain adaptation strategies to transport that learning in real environments where very little data will be necessary for doing that conversion okay so so let's go back to the causal aspect of things I think that what's going on right now in many deep learning systems is that they're looking for simple clues in the data that allow the learner to get the right answer on the training data and then as soon as you test them on something sufficiently different they tend to break down and in fact we have some papers where we're trying to analyze the kind of features that they have learned and what they're sensitive to and often what we find is that they're not sensitive necessarily to the things we think they should be sensitive to that instead of capturing the objectness for example in images they're capturing all kinds of low level clues that have to do with texture and the frequency of different patterns and things like that humans are very different humans are actually spending a lot of mental energy trying to figure out the causes and explanations for things so this is something clearly lacking in our current systems there's a tool though that we've been making a lot of progress in my community which is these deep generative models and I think this is going to be actually very very useful as we move forward towards building these more causally motivated architectures because part of what a causal agent needs to do is to simulate the future in some way and again Josh was talking about this and I think this is really important we have an internal mental simulator and we've made a lot of progress in the ability of training these neural nets for example with the GANs and others to sample from complicated distribution in a way that's fairly accurate but that's not enough but it's going to be really important as we build these agents that are more like model-based reinforcement learning in which the agent is both learning a policy but also learning how to project itself into the future in order to take decisions and that's what planning is about so this brings me this causality discussion also brings me to the IID assumption that we're making in machine learning we're assuming that the test data comes from the same distribution as the training data and even like our theory is relying on this and currently we're lacking a theory to explain how humans for example can generalize very far from the training data for example you can read a science fiction novel and it's talking about a situation that's never happened will never happen but you can figure out what would be the sequel after you read half of the book so what I'm proposing is that from the theory point of view that we spend some time exploring other structures for our learning theory in which instead of assuming that the test situations will come from the same distribution as the training situations we only assume that they share the same causal mechanisms so what does that mean? so you can think of causal mechanisms as the set of gears which give rise from initial conditions to some states of the world which we can observe so if we have the same causal mechanisms and the same initial conditions we get the same distribution out but we could assume that we only have the same mechanisms and different initial conditions and then we get something that may be very different in appearance like if I'm on the moon it looks very different from Earth but it's really the same laws of physics and humans are able to do that when you read the science fiction novel there's some often explicitly made assumptions that is the beginning of the novel and from that point on it's just all logical good science fiction novels anyways so let me tell you a few words about a project that we've started in my group which tries to go a little bit in the direction that I'm talking about and I think it's just one path and we need a lot more people to explore many more paths so I call this the baby AI game project and the goal is to build a game which eventually real humans will play we're not ready for that and in the game the human will play the role of a teacher or a professor for a virtual agent which we call the baby AI or the baby AI learner and the baby AI learner lives in some environment like a video game and the baby AI learner and the human player interact in natural language and initially the baby AI doesn't know much it looks like it doesn't know anything but I guess it has a little bit of prior knowledge that allows it to initiate an interaction with the player and the player on the other hand knows a lot the player is a human the player can play the game and the player even has knowledge about pedagogy we're used to teaching others we know this intuitively and sometimes we take courses to do it better and so the game is really about how the human player will figure out the best way to teach that baby learner for example designing the appropriate curriculum that's going to be adapted to the behavior of the baby this game would also be interesting from a scientific point of view for a number of reasons one of them is to collect data about human machine interactions with a human in the loop and especially natural language data of this kind and furthermore it's not static data because the game would be played by many people and so you could design experiments like you could send your experiment to the game environment the experiment would consist of some learning procedure for the baby and maybe some novel levels for the game and you could collect data about how things go and thus learn something from the scientific point of view about how to design better learners this could also be used as a benchmark to compare different age learning mechanisms and the biggest challenge from a machine learning point of view here is sample complexity current reinforcement learning methods are very demand a lot of data before they can learn very very simple things so we actually submitted a paper on this project at iClear and we run some benchmark experiments where you can easily need millions of interactions between the baby and the human in order to learn some very simple things like learning to fetch things and find things in some environment so we've designed a set of very simple right now just 2D levels and a template language that is combinatorial so there's a huge number of potential queries and missions that we could ask the baby to solve I don't have a lot of time left but let me say a few words about what I think another thing that's related to this that we need to change in the way that we simulate the future so I talked about these generative models which can predict the next state of the world given the current state of papers doing this kind of thing and the traditional machine learning approach to learning a model for model based reinforcement learning or in general for modeling data sequences of things in the world is goes like we do in the language model like predict the next frame given the previous frames or the next observation given the previous observations and that sounds reasonable because when you're doing that you need to be able to join distribution however if the goal is to build a machine which will be used by an agent to plan and to simulate the future in ways that are useful for that agent to take decisions I think this is completely an overkill and the training objective is not putting pressure in the right places so if you observe if you introspect a little bit about how you plan what kind of thoughts do you have when you're projecting yourself into the future you realize that well you're not modeling in perfect detail all the pixels that are going to come at the next time step this is not what's going on first of all you can project yourself into the future at arbitrary points in the future we're not modeling t, t plus 1, t plus 2 and so on we don't even have to specify if it's t plus 20 or t plus 2000 we just know that later I have to catch a flight roughly it's sometime this evening and next week I have this important meeting but I don't remember when but I can still plan with that in mind so time isn't handled in this way by humans and furthermore when we think about the future when we project ourselves we don't represent the full state of the world like the whole of details of what's going to happen it's impossible right there's so many things we can't predict the distribution would be way too complicated and having just a few samples of that distribution would not characterize it in a way that's sufficiently useful so how do we do it so I think that the way we do it is that we focus on just a few relevant aspects of the future that matter to the plan we're thinking about right so this is connected to an idea that is more like another research project that's connected to the baby eye game which we've started in my group which I call the consciousness prior the idea is that we're going to learn these representations with neural nets of course but we're going to distinguish two kinds of representation so we have the traditional representations that capture a lot of information about the input and maybe the past the unconscious state but we're also going to learn using attention mechanism to select just a few dimensions or projections from that high dimensional unconscious state to think of it like select a few dimensions a few variables that are going to be your thought at a particular moment so it's just think of it like a sentence in English or the conditions in a rule in a rule-based system there's just a few variables and their values and we have an attention mechanism which does this selection and really what's the reason I'm calling this a prior is because let's see, no I call this a prior because the mapping that's needed from input and this unconscious representation is going to have to be very special so that I can make these plans about the future and statements about the future using only a few dimensions at a time so if I were to try to build these conscious thoughts that allow me to make true predictions using directly pixels it would be very difficult just pick three or four pixels and hope that I can predict one of the four pixels given the three others with high probability however if I make the prediction in the right semantic space for example if I say I'm going to catch this object the probability of that statement being true is very very high and the reason this can happen is because I'm doing it in the right level of representation where I can make these very strong predictions and the prior here is that there exists true statements about the world predictive or not sometimes explanatory which can be made using just a few well selected variables and so the prior is that it's going to impose something on the way that we want to represent high level information it's been a little bit of my quest in the last decade how do we discover good representations how does a learner figure out to disentangle the underlying causes and underlying factors that explain what we're observing and so the idea here is to take advantage of something that the classical AI people figured out a long time ago that there's a lot of knowledge about the world which can be expressed with these very simple rules that involve only a few variables at a time and what I'm hoping is that by imposing that extra regularizer if you want we're going to help those learners figure out more useful representations and hopefully bridge the gap between system one computation which you can think of what's going on here and system two computation which is going on here where we pick thoughts and we use them to reason and plan alright so I'm going to conclude there are lots of things that are needed to move closer to human level understanding of course things like you know cheaper and faster and less energy-hungry computing but also fundamental changes in the way that we're thinking about learning representations learning to understand language finally addressing the question of causality in our machine learning methods also something I didn't spend time on if you think about learning agents in these very high-dimensional spaces it's not enough to be passive in order to discover the information that the agent needs it's probably going to need to explore but not explore by a random walk explore in a smart way like a child is playing and doing just the right things to find information about the world or think as Josh was talking about how a scientist is doing experiments in order to acquire information so it's not random experiments and then we hope something good comes out of it there's sort of planning that goes on in the act of acquiring information actively so thank you very much questions yes is it fair to say that the consciousness prior is a sparse prior on a set of latent and conscious state yes except that the sparsity is dynamic right and it's controlled by this controller that decides what we're thinking of at any particular time right so yeah it's totally about sparsity but it's a dynamic sparsity it's not always the same things that are going to be activated based on context but yeah it's a good statement next one are there any results where your proposed relaxation of IID yielded formal results no I'm hoping that people will tackle this problem I don't have the answer to it but I feel like it's a direction for extending learning theory and yeah we need to do this technically and formally the next one can you discuss the need for integrating different sensory perceptions to ground language learning how can it be done so I think we have to be coming back to the question of knowledge so I don't think that it's so important to have many sensory modalities what matters is that the sensory modalities give a view on the environment that's sufficient for an agent to figure out how the environment works so as a counter example I think that when we do ground language learning by trying to associate sentences with images it's insufficient because a static image doesn't give us enough information about the environment even if you train with lots of these images it's going to be difficult for the learner to figure out for example the 3D nature of things simply by looking at these images so you'd need at least that agent to be in an environment maybe with stereo vision or sequences of images so that it gets a chance maybe actively to figure out the objectness of things in 3D and so on so it's not about the number of sensory perceptions but that the sensory perceptions are rich enough to allow the learner to figure out the concepts that matter in that environment the next one do you see grounded something disappeared the genetics algorithms also focus on the causal effects through evolutionary something what are the pros and cons of baby AI in compare to genetics algorithms I think this is just addressing different questions so genetic algorithms are about optimization and here I'm thinking about a framework in which we can evaluate different agent learning mechanisms and grounded language learning the next one have you made connections to researchers in symbolic AI who have been working on these problems for decades not recently but you know I'm old enough that when I took AI classes it was all about classical AI and symbolic AI so I have some knowledge of that but you're right I should I should reach out more to these people and I've started to reach out more to people on the side of cognitive science and neuroscience I think in child development I think this these people have to teach me a lot that's relevant here and next one vocabulary is often a measure of intelligence in humans I don't like that how does increase of vocabulary affect learning models yeah I don't know how to answer this so I can tell you about an experiment we run a long time ago where we were able to speed up the training of a language model by doing a curriculum where we started with a small vocabulary of the most frequent words we gradually increase the size of the vocabulary so I think the thing with vocabulary for me is not about how big it is but what it represents from the point of view of the learner about the aspects of the world that are being understood by the learner by the agent so for me small vocabulary presumably means there are few concepts in the world that I understand and talk about I think as children understand more and more things then they are able to put words on these things that's how I would put it and next one substantial increase in computing power is needed I agree what hardware is best in your mind well there's short term and long term things here so in the short term there are lots of people who are doing I think what's the right thing to do in the short term digital circuits that are meant for doing the kind of calculations that are currently done in ND planning and I'm pretty sure that we'll get very significant speed up in the next few years using these approaches companies are already putting out some chips but in the long run I think we may need to really explore very different kind of even devices I think we need to go in the log to some degree but this requires longer term investments and have some research going in that direction as well