 Bon anniversaire. I'm kind of envious because I turned 60 in July 2020 and there was no way to organize anything. So, okay, a lot of things are happening in AI today. And I'll try to talk about sort of interesting avenues to make AI even more impressive than it's been over the last few months. And then, you know, point out the limitations of what's happening. So machine learning sucks. It really sucks. At least if we compare it to what humans and animals can do. I mean, our ability to learn things really quickly, to figure out how the world works, mostly by observation when we're babies, is amazing. And we cannot reproduce this with machines today. Despite all the hype that you hear, we don't know how to do it. So obviously, supervised learning has been, you know, widely successful for a lot of applications. Reinforcement learning has kind of limited success, mostly in games and things like that, because it requires insane amounts of trials. And what's taken over the world over the last few years is something called self-supervised learning, which I say a few words about. But in the end, we still have systems that are specialized, somewhat brittle. They make stupid mistakes. They do not reason and plan, at least very few of them do. If we compare this to humans and most animals, they can learn new tasks extremely quickly, understand how the world works, reason and plan, and have some level of common sense. And we still don't have machines that can do this. So one limitation of current, most current AI systems is that they have a constant number of computational steps between input and output. That includes things like auto-regressive large-language models that a lot of people have heard about in the last few months. There's a fixed amount of computation to compute every token, and that limits the reasoning ability of those systems. They can't really plan either. They're auto-regressive, so they produce things one after another. So how do we get machines to learn and act more like humans and animals, particularly being able to reason and plan? So let's talk about self-supervised learning first, because it's really created the last revolution in AI, and that's been announced for the last seven or eight years. I've been a big proponent of this. Self-supervised learning is the idea that you capture the internal dependencies within a signal by basically training a machine to predict. So if you were to train a machine to predict video, you would show a video clip, and then you would reveal the next segment of the video and then train a system to attempt to predict what's going to happen next. The masking doesn't need to be about the future. It could be about the past. It could be different parts of the input, but essentially take an input, mask a piece of it, and then from the piece that is visible, try to capture the dependency of the piece that is not visible or not currently visible. And that works astonishingly well for things like natural language understanding. So every top NLP system over the last five years or so has been trained the following way, or pre-trained the following way. You take a text, you corrupt it by hiding some other words, 10 to 15% of the words by replacing them by a blank marker or substituting them for another one, and then you train some gigantic neural net, generally a transformer architecture, to predict the words that are missing. And in the process of doing so, the system learns internal representations of text that represent anything about the syntax, the semantics, the meaning, everything, the style. And you can furthermore train those systems to be multilingual. So you don't have to train them on a single language, you can pre-train them with multiple languages, and those systems find some sort of internal representation that is language-independent, which is kind of baffling. But it works amazingly well. And as I said, this is not a new phenomenon. Those things have been used in production, very widely deployed over the last four or five years. And this is what has allowed companies like Meta in Facebook and YouTube and others to do content moderation, much more efficiently detecting things like hate speech. It used to be that the proportion of hate speech that was automatically detected was on the other side of 30%, about five years ago. Now it's 95, and it's just because of this. Translation systems, they work really well now. It's because of this. So incredible revolution. And those systems have been used also for generating content, either text, images, videos, et cetera. And for this, it's a special case of what I described, which the masking that you perform is not, you don't match sort of random words in the text, you just match the last one, okay? So you train a gigantic neural net to just predict the last word in a long sequence of a few thousand words taken from a corpus. And you train this system on, I don't know, a thousand to two thousand trillion words. And with neural nets that have data transformer architectures with a particular style of connections inside that makes them causal, so there's neural nets can only pay attention of stuff that's in the past of whatever it is that they're predicting. And they may have, on the order of billions to a trillion parameters. And then when you use them, once they're trained, you use them by producing the next word in a text. So you feed them a prompt, you ask them to produce the next word, and then you inject that word into the input by shifting everything else by one, and then produce the next, next word, and then shift again, et cetera. So that's just autoregressive prediction, a real concept, of course. And the amazing thing is that when you make those systems big enough, there's some sort of emergent property that happens. They seem to not just understand to some extent the text that they're reading, but they can produce text that kind of makes sense, particularly if you fine-tune them for a particular task, like answering certain questions through human feedback. So there's a long history of language models of this type that predict the next word going back to Shannon. So it's a new idea. The first neural models to do next word prediction were by Yosha Venju in the mid 2000. And what happened in the last years is just carrying them up, essentially, and having access to more data. So there's a series of dialogue systems that have been released by various companies or labs. Blenderbot, that was a couple of years ago, Galactica was in September last year. This was trained on the entire scientific literature for the purpose of helping scientists write papers, all of us. It was released as a demo and it was murdered by Twitter. So a lot of people on Twitter said, oh, this is horrible. People are going to use this to generate fake scientific paper. This is going to flood the peer review system and society will be destroyed. And as a result, the people who created Galactica at fair, it was a small team of five people, got so distraught that they took it down. And then the leadership at META said, oh, this is too dangerous. We're not going to release anything like this again. So, I mean, the reaction of the public on this, you know, can have very damaging effect, under the pretense of ethics. It actually damages the progress of science. So anyway, we have to be careful about this. Then there is the next thing that was released by Fair very recently is a system called Lama, which is open source. So this is a large language model. It was, you know, same auto regressive. There's several sizes from 7 billion to 65 billion parameters. The 13 billion parameters gives better results on benchmarks than the 175 billion parameter GPT-3. So this progress has been made. It's open source. The inference code is open source, but the model themselves are behind a firewall. You have to apply to get the weights of the network. And the reason for this, and when you get them, you can use them commercially. And the reason for this is that those systems have been trained with lots and lots of data from everywhere on the internet. And a lot of people who provide this data are not happy that the data is used to train language model. And so if Fair or Meta was to distribute this commercially, it would probably get sued by a whole bunch of people like Reddit and Twitter and whatever. So no open source, no AI industry in the open source world because of legal issues. Again, people talk about ethics, but that's a big ethics question. Alpaca was a system by Stanford that was basically a fine-tuned version of Lama. And then there are similar systems at Google, at DeepMind, et cetera. Huge teams working on this in all of those companies. And of course, everybody knows about chat GPT for the only reason that it works really well. It's been fine-tuned for like a year or two. And it's available to the public. But in terms of underlying innovation and technique, not much, it's just well engineered, essentially. I said this on Twitter and also was accused of being jealous or something. So the performance of those things is amazing. They're very useful, particularly useful as writing aids, but they made really stupid mistakes. They made factual errors, logical errors. They're really inconsistent, particularly for long utterances. They have very limited reasoning. There's no way to control for toxicity, et cetera. It's, and they don't have any knowledge of the underlying reality. They're purely trained on text and they may be surprising to many of us, but most of human knowledge has nothing to do with language. It's the knowledge of the physical world or intuition or even for mathematicians, right? If I ask a question, I multiply a vector by a positive semi-definite matrix. Can the resulting vector form an angle larger than 90 degrees with the original vector? And all of you here, most of you, at least, I'm sure, all of you have some mental model of what a positive semi-definite matrix does to a vector and realize that it only stretches the axes and it can't possibly rotate a vector by more than 90 degrees. Or you maybe remember a theorem that says that a quadratic form produces a positive number. Regardless, but you have a mental model that you use, right, because of intuition. Those systems have no mental model or whatever mental model they have is purely built from text and very shallow in its understanding of the world. No intuition. But you can't remember the theorem. So it's very useful to use those things to generate text, particularly for text that is very organized, like code. So this is going to revolutionize the way software is being written. This is some code that's being generated with the Lamar 65 billion, this open source thing. And you just specify this, finds the real route of blah, blah, blah, and the sequence just writes the code or write a regular expression to remove HTML tags and Python string, et cetera. So short code that works really well. Entire software, no, because those systems can't plan. They can't really kind of organize data structures and stuff, but they'll write code for one page or something. But they hallucinate. So my colleagues, Joe called me, did you know that you're gonna drop a rap album last year? We'll listen to it and here is what we thought. And the system just continues and sort of makes up a story for it. I put out a rap album. I actually don't like rap particularly. I'm more kind of a jazz person. So I asked them to do the same thing with jazz and they said, no, it doesn't work because there's not enough jazz review online. So I cried. Okay, so what are they good for? They're good for as a writing aid, certainly. Writing assistance, first draft generation, statistic polishing, which is really good for the many of us who are not native English speakers. They're not good for producing factual and consistent answers. They hallucinate. They're taking into account recent information. They're trained with data that's two years old, essentially. Behaving properly. They just respect the statistics of the data and that's really depends on what they've been trained on. They don't do reasoning. They don't do planning. They don't do math. They could be using tools, such as search engine calculators, data bay queries, et cetera. People are working on this actively but charging it doesn't do this. But it's a very active topic of research. We're easily fooled by their fluency into thinking that they are smart but they're not. And they don't know how the world works. And here is a little bit of folk math about it. Let's imagine that the sequence of tokens that such a model can produce, you can organize them in a tree. Okay, a tree of all the possible. For every first token, there's a number of different paths that correspond to like 100,000 or something correspond to all the words in the dictionary and that for each of those, you have a thousand different words, et cetera. So they organize as a tree. The set of correct answers, however you define that, is a subtree within that tree. And as you imagine, for the sake of simplicity, there is some probability E for every token being generated to take you out of the tree of correct answers. Okay, assuming the errors are independent, which of course is false, the probability that a sequence of tokens of size N is within the set of correct answers is one minus E to the power N. It's a diffusion process with exponential divergence. Which means there's no way in hell those things can work well. Like no way. There's exponential decaying. The only thing you can play with at the moment, which is what a lot of people are essentially boiling small lakes to do, is to make this E smaller. Okay, but you can't fix the fact that it's an exponentially diverging diffusion process. So this is not fixable without a major redesign and that's what I'm gonna talk about. Okay, so that's not the only problem that those things have. They have a constant number of computational steps between input and output for each token, which sort of, and they have sort of weak representation power. They don't really reason and they do not plan. I made that point before. And they're missing a lot of characteristics of human and animal intelligence. So they suck. All right. I mean, they're very useful. They're gonna create a new industry. They're gonna revolutionize the world, but they suck. So how do humans and animals learn so quickly? We learn a lot of things about how the world works in the first few months of life, mostly by observation. And then after we learn how to use our names with interaction, but at first it's mostly observation. So we learn really basic concepts, like the fact that the world is three-dimensional, the fact that when an object disappears behind another one, it still exists, the fact that there are categories of objects in the world, even if we don't know their names, we know that there are different spontaneous categories. And then around the age of nine months, we learn about things like gravity that object is supposed to fall if they're not supported in inertia, intuitive physics, right? That takes a while. You take a, you put a, I don't know, eight months old on a high chair with some toys on the table in front of them, it's gonna systematically put them on the floor, like, you know, and watch, right? Because that's the experiment that gravity actually works, right? But then how is it that, you know, babies can learn how the world works like this. How is it that, by the age of 16 or 17, any teenager can learn to drive a car with 20 hours of practice or something like that. And we still don't have self-driving cars. You know, we may have GPDT or GPT-4 or whatever it is, but we don't have robots that can, you know, clear the dinner table and fill the dishwasher, even though a 10-year-old is capable of it. So this is a new example of the Moravec paradox, which is that computers can do stuff that seem complicated for humans, but can't do the simple stuff that humans take for granted, is still with us. So perhaps the accumulation of the background knowledge that babies learn when they watch the spectacle of the world is what constitutes the business of common sense. And so I see three challenges for AI and machine learning research going forward. The first one is learning representations and predictive models of the world. That's going to use self-supervised learning, a form of self-supervised learning, learning to reason in ways that are compatible with, you know, neural nets, essentially, and learning to plan complex action sequences because that's one of the essence of intelligence. So I've made a proposal. I wrote a long paper, quite readable for wide audiences, not very technical, that I put on open review so people can make comments and tell me I'm wrong. It's called A Path Towards Autonomous Machine Intelligence. I gave various technical talks about it, a bit longer than this one. And here is the story. The paper has been online since before the summer, so this predates, you know, the chat GPT and everything. So it's based on the idea that an intelligent system should have some sort of, what's called cognitive architecture, some organization. And what I'm proposing here is basically built around this idea of world model. So a world model is the mental model that we have of some reality that we're dealing with that allows us to predict how the world is going to evolve, particularly how the state of the world is going to change as a consequence of actions we might take. Because if we had such a model, that allows us to plan a sequence of actions to arrive at a particular outcome. The entire purpose of the system is to minimize some internal costs and when I say minimize, I don't mean minimize by learning, I mean minimize by acting. So the system figures out a sequence of actions that according to its internal predictive world model will arrive at a state where its internal cost is minimized. And once it's planned the sequence of action, it just outputs the first action or group of action in the world and then gets the estimate of the set of the world back and then repeats the process. So this is a planning, very similar to kind of classical planning in optimal control. So there's two ways to use architecture of this type. The first one is reactive, similar to what Daniel Kineman, famous psychologist called System One, which is sort of subconscious action if you want, where you perceive the world, extract some internal representation of the state of the world through a perception system and then directly run this through some neural net that produces an action. So this is just reacting essentially. Autoregressive LLM are of this type. The world to them is a window of previous words that have been entered to them or they have produced and they just produce the next word, right? It's just direct. But here is mode two. So mode two is considerably more sophisticated and this is really what humans and many animals do. You perceive the world, run this one encoder that gives you some sort of representation of the estimated state of the world, whatever is perceived at the moment and then you run this through the world model which is here represented by this predictor and the world of a predictor is from the state of the world at time t and an action you might take what would be the state of the world at time t plus one? Okay, so you can hypothesize a sequence of actions that you imagine in your head. You predict the result and then that goes into some cost function that measures to what extent you've satisfied a task that you want to accomplish. This is very classical model predictive control from optimal control, except here we're going to learn this model and the cost function may be complicated and the optimization problem of finding the sequence of actions to minimize the cost may be highly non-convex. We're gonna have all kinds of problems. I'm not specifying what method we use to do this inference at this point. You can use whatever you think is appropriate. The several models that we propose along this line mostly for robotics control, not in the context of NLP or anything like that. But those systems can plan. They plan ahead, right? They have an objective they have to satisfy and they plan a sequence of actions or a sequence of words if it's a dialogue system to arrive at this objective to satisfy the subjective. This is not autoregressive. So my prediction, which may be wrong, is that within five years, absolutely no one in their right mind would be using autoregressive LLMs. They would probably use something like this because you can correct for hallucination, you can correct for toxicity, you can correct for all kinds of stuff by designing those cost functions in the appropriate ways. Okay. Okay, so how do we build and train the world model? We're gonna use self-supervised running but there is an issue. Self-supervised running works really well for text and the reason it works well for text is that although you can never predict exactly what word appears in a particular text that if you don't see that word, you can easily produce a probability distribution over all the words in your dictionary and basically manage uncertainty in the prediction this way, right? So if I say the cat chases the blank in the kitchen, the blank could be a mouse or something but it's not necessarily a mouse, it could be, I don't know, a laser pointer dot or something. So the system can produce a probability distribution over words and get away with dealing with uncertainty this way. If you're going to do video prediction, we don't have a way to properly represent distributions over all video frames and certainly not over all video clips. So we're gonna have to cut some corners, okay? To deal with uncertainty in prediction in continuous spaces. This is the reason, the main reason why where we don't have at the moment self-supervised running systems that are trained from video and can learn how the world works from video because we don't know how to deal with that problem or at least we have ideas but this is some work. So, oh, this is a delay. Okay, so this is a system that was trained to predict the trajectories of cars from a top-down video on a highway and if you train a neural net to make this kind of prediction, you get those blurry prediction, same here. It's blurry because if you ask the system to make one prediction, the only thing you can do is predict the average of all the possible outcome and that's not a good prediction. So you have to find some ways of representing uncertainty. And my solution to this is to abandon the whole idea of generative models, okay? And the generative model takes, so let's say you want to capture the dependencies between X and Y. X being, for example, the initial segment of a video and Y being the continuation of it. Run X to an encoder, run the representation to a predictor and then measure the reconstruction error. That's a generative model. The problem with this is that there may be a huge number of details in Y that are completely relevant to any task you might imagine. In this room, the precise texture of the carpet is irrelevant. There's no way you can remember it actually, right? Or, I mean, you know, the position of every hair on everybody's head or something. It's irrelevant to any task you'll ever care about. And so you don't want a generative model because a generative model will have to actually model all those details if you have it reconstruct pixels. Otherwise it's gonna make a reconstruction error. So what I'm proposing is something called a joint embedding architecture. And this is based on experimental results. The reason why we want to use this. So you take X and Y, run both of them through encoders, and then you do the prediction in the space of representations extracted from those encoders. There's a slight issue with this, which is that if you train a system like this overall, you train the encoder and the predictor simultaneously to minimize the prediction error is going to collapse. It's going to ignore X and Y, set SX and XY to constants, and then make this prediction error, this prediction, just the identity function or some fixed function, right? Not even a function, it just needs to, you know, map the single SX to the single SY that is constant. So that doesn't work. And so in fact there are several flavors of this joint embedding architecture, kind of a simple one that we used to call Siamese networks. Sort of predictive models like this and then predictive models where you can have a latent variable here that represents the fact that the prediction of SY from SX may not be deterministic, so you need to parametrize the set of possible SY using a latent variable that can vary over a set or is drawn from a distribution. So that's the joint embedding predictive architecture and to train those things, we have to abandon probability theory, okay? So I ask you to abandon generative models and there are probability theory. We have to be, to use the weaker form of how we capture dependencies between variables which is energy based model. So energy based model is if you have two variables X and Y, you want to capture the dependencies between them, it suffices to produce a function that takes low values on the manifold of data where you have data points and then higher values outside, okay? If you have a function of this type, it captures the dependency between X and Y. You don't need anything else. If you want, you know, probabilistic prediction along the density, you may have some trouble, but just capturing dependency, this is sufficient. And it's a good way of representing what goes on in one of those joint embedding system. You basically want to train the system, in fact I'm gonna go up here. You want to train the system to take low energy, produce a low reconstruction energy or prediction energy or whatever it is on data points that you train it on and then higher energy outside. These two classes of methods for this, contrasting methods that are still being popular and I kind of invented them a while back for joint embedding architectures, but I've become very skeptical about them these days because I don't think they scale very well with the dimension of the representation space but that consists in generating contrastive points and pushing their energy up while you take the data points and you push their energy down. But what I prefer now is what I call regularized methods and the regularized method basically try to minimize the volume of space that can take low energy. So whenever you push down the energy of certain regions, energy of other regions have to go up because there is a limited supply of low energy space if you want. And without, so I ask you to abandon generative models, abandon probabilistic models, abandon contrastive methods which are kind of popular and also abandon reinforcement learning which I've been saying for 10 years. And so these are the pillars of machine learning at the moment and I'm not making any friends. Okay, so this regularized method there is a way to make them work and to prevent the system from collapsing essentially and the basic idea is that you find some measure of the information content of the representation that comes out of the encoder and you try to maximize it, okay? So if you have an objective function for training, it measures the negative information of SX and SY and you minimize it. Other details I'm gonna skip. One way to do this is so to prevent the system from just producing constant vectors, one way to prevent this from happening is you put a cost function on the standard deviation of each variable coming out of the encoder. We take each variable and you say over a batch of samples I want the variance to be at least one which you can implement with a cost function of this type. It's basically a hinge loss on the standard deviation. Now the system can cheat and just decide that all the variables are equal, right? So it's not very informative. So to get rid of that problem, you minimize the off-diagonal covariance terms of those things and so basically you're trying to make the covariance matrix of those vectors close to identity. And there are other people who have had this idea in similar way, Yimba for example, with method equals NTR squared. But that is not sufficient because the system can still cheat by making the components of SX uncorrelated but still dependent. And so there's a trick here which we have some theory for but not entirely which is to insert a neural net here that expands the dimension of SX into a bigger vector and you train this network simultaneously with everything else and you apply the variance covariance criterion to the output of this. And that tends to make the components of SX more independent. But you have to realize that what we're doing here is that we're pushing up on an upper bound on information content hoping that the actual information will follow. And that's because we don't have lower bounds on information measure. Yeah, so this is some theory that shows that the variables of SX or SY become independent. So then we can test those systems by training them with a bunch of, for example, if you want to train them to do image recognition, you show them two distorted versions of the same image and you tell the system whatever representation you extract should be the same because this is really the same image with the same content. So you pre-train the system. You don't need any labels for this. You just need to have a way of distorting the image. And then you feed ImageNet to it and you train a linear classifier or a very simple classifier on the top. You don't fine tune the trunk and you measure the performance. And this works really well. So you can train systems to get really good performance in the mid-70s also with SSL by pre-training on ImageNet and then fine-tuning on ImageNet with labels. If you use a bigger training set, I'm not gonna bore you with details. There's variations of this method, VCOEG, which means variance in variance, covariance regularization to train systems to do image segmentation, not just classification, so as to learn local features. And then another technique that came out of FAIR as well called IJEPA. So this used this JEPA architecture with predictors and basically the basic idea of it is that you train the neural net to predict certain areas of the representations of the image from other areas. So you mask a piece of the input image, you run through, you get a representation and from whatever representation you get, you train the system to predict the representation that is produced from the full image. This works really well. Like amazingly well. It's very fast. And it beats other computing methods for self-supervised learning. Again, I'm not gonna bore you with details. There's some theory which I'm gonna skip. There is a theoretical paper here that you might wanna have a look at. It just came out in the last few days that I co-authored with Ravid Shvartziv who's a postdoc at NYU on some sort of information bottleneck approach to explaining how self-supervised learning and supervised learning work based on information theory results. Okay, I'm coming to the end. Okay, so one reason we might want to train a JEPA is that if we have a JEPA, we can use this architecture as a predictive world model that we could use in an intelligent system capable of planning. So imagine that we have an observation about the state of the world here. We feed it an action. This predictor may predict a representation of the next state of the world, which we can then fine-tune if we actually observe the next state of the world and we can sort of back propagate gradients to adjust the system. So that would be for like single level planning. But really what people do when we plan an action or a sequence of action, we plan hierarchically. So what we want is some sort of more abstract representation of the world that would allow us to make longer term predictions. It is more abstract representation that may have fewer details about how the world works. So it's called hierarchical JEPA, which could make a prediction at multiple levels. We actually have some experiments about something like this, which actually has some connection with wavelet transform. But I'm not going to go into the details because I don't have time. But it's a system that basically is trained from video and it's simultaneously tried to learn to predict the representation of future frames from previous frames and also learn representations that would be appropriate for image recognition. It's a pretty complex architecture so I'm not going to explain how it works. But here is in the end the architecture you might want to use if you are able to train a hierarchical JEPA. You observe the state of the world, run this to an encoder, run this to another encoder and yet another encoder and then you get a very abstract high level representation of the state of the world. And perhaps the task you want to go here is, you want to do is, I don't know, go from here to New York City. Okay, so my cost function is my distance to New York City. Computed from the state predicted by this predictor. The first action I have to do is go to the airport and then catch a plane. So to New York, right? How do I go to the airport? So I have to take an action, of course, to go to the airport. Some macro action like they can represent by this latent variable Z. But basically here, I represent the state of being to the airport and this cost function now for the level below is how far from the airport am I? How far from Charles de Gaulle am I? I first need to call a taxi. Taxi is here. And then tell the taxi to go to the airport and that will take me to the airport. How do I catch a taxi if I'm in Paris, let's say? So this is whether I'm in a taxi or not. First I need to get on the street and hail a taxi, which actually has a low probability of succeeding in Paris. But okay, so now, but in fact, this is not the lowest level. The lowest level is millisecond by millisecond muscle control, right? So there is a very deep hierarchy of such things, right? So as I was saying, intelligent task and you might think that humans are the only animals capable of doing this. No, your cat does that. Your cat, if it wants to jump over this blackboard, we'll go here and then look around, move it said, and then jump here, jump here, and jump here. And figure it out. This is pretty complex planning. Requires very accurate world model. So cats have world models. LNM's don't. I don't need this. So that's the challenge of AI for the next few years. Figuring out how to make self-supervised running work for video, handling uncertainty and prediction, probably using joint-emitting architecture, perhaps using energy-based model framework, running world model observation, and then using this to plan and reason. And we can ask the question once we figure this out. We will have machines that are as intelligent as humans and animals, and the answer is perhaps. It may not be the only required component, but that would be part of the story. Questions people are asking themselves, maybe not in a serious company like here, but a lot of people are asking themselves those questions now are saying, chat GPT, GPT-4, they seem to have superhuman intelligence, they can do stuff that most people can do, et cetera. But it's easy to get fooled, they're not that smart, and they certainly don't understand how the world works, but there is no question that at some point, we're gonna have machines that are more intelligent than humans in all domains where humans are intelligent. There's no question about this in my mind. And it's not going to be general intelligence, like a lot of people refer to, because human intelligence is actually very specialized. We like to think of ourselves as having general intelligence, we don't. We're incredibly specialized. So I prefer to talk about human-level AI rather than AGI, but before we get to human-level AI, we're probably gonna have to go through cat-level AI, maybe, or dog-level AI. It's a joke, like before we get to God-level AI, we need to get to dog-level AI. Anyone happen tomorrow? It's probably gonna take a while, but it's clear that progress is accelerating because there's a lot of business interest behind this. Thank you. I'm already a little over, so we have time for one question more. There is a roboticist, Todd Lipson, who is teaching these machines that have been randomly wired to walk or something like this. How is it related? Yeah, Ron Lipson is at Columbia. He's using reinforcement learning. So this is one of the things I say we should not use, or at least minimize its use. So I think the purpose of reinforcement learning research should be to minimize the use of reinforcement learning. The reason being that reinforcement learning is so inefficient in terms of data, right? I mean, we all hear about AlphaGo and the success of reinforcement learning for game-playing and things like that, including also for poker, playing, and even diplomacy. But those systems require enormous amounts of trials. The number of games that are played by AlphaGo to train itself to reach superhuman performance or human-level performance is on the order of millions of games, and it's insane. So your proposal is that it will do it faster? Yeah. That said, Go is a very difficult task for humans. That's why it's an interesting game. It's because it's hard for humans. And it turns out humans suck at this. I mean, machines are much, much better at this type of basically arborescent planning and sort of commutorial search than humans who have very limited short-term memory and kind of slow brains, right? So the best Go players in the world before AlphaGo thought there were maybe two or three stones handicap below God, the ideal Go player. And it turns out, no, humans just are so bad. I mean, it's like nine stones behind. It's like a beginner compared to an expert player. We're really, really bad at this, which is why it's not that hard in the end for computers to be better than us. We're just bad at it. That was a question. Yeah, I mean, you draw a path cat dog before arriving at human or on the other hand, we thought that the main difference between humans and animals was language. And suddenly you have these chat GBT and systems that reproduce it and much more than language, the ability to reproduce proof even if it doesn't understand and so on. So there may be other paths to intelligence and what is surprising is what you define as the difficulty is not what is the difficulty in the, is what animals have, what we thought was not difficult. So there is something a bit strange here, which is not just the business as usual. And so that's my first question. And the second one is just more technical, is you've been pushing towards these energy models to abandon all the constraints of probabilities. On the other hand, the new proposal that you are having, you are pushing normalizations and so on. So it has a bit flavors of back to probabilities. No, and that's what I want to understand. Because you can't, with the joint embedding architectures, the Y variable, which is the one you're supposed to predict, if you have a probabilistic approach, you're supposed to basically identify P of Y given X in a prediction framework. But the Y variable now goes to an encoder. So to compute P of Y given X, you would have to invert this encoder. Problem is that this encoder is not invertible because there's many Ys that would produce the same representation. That's kind of the whole point of this approach which is that the encoder that looks at Y will eliminate all kinds of irrelevant information so that the invariant space, if you want, for a given representation, the input space of Y that produces that representation is an entire manifold. But when you have a probability distribution, you have a Gibbs energy which forgets about the irrelevant and you still can reproduce a sample, textures, yields and so on. Okay, so you can take the prediction energy and do e to the minus its energy and normalize. You can't normalize. Your integral does not converge because the space of Y for a given level of energy is as non-zero volume or whatever. So you can't normalize. You can invert that function. There's no way you can turn this into a probabilistic model so you have to abandon the whole idea. And what about the first question? Okay, so first question is interesting and relevant. There's a lot of, we are biased as humans to think that most of the knowledge that we have is language-based and as I said, it's not true. Most of human knowledge is actually non-linguistic. You could say that, so think about the quantity of data that something like a large-language model like Lama is trained on, 1400 billion tokens. If you had a human read for eight hours a day at normal speed, this would take 22,000 years to read. Okay, so obviously those things work, but to work they need to be trained on enormous amounts of data, right? Which humans don't seem to be requiring. So we're obviously able to extract a lot more about the underlying structure of the world and reality with considerably less data. The total amount of video frames or the equivalent that a five-year-old has seen during his life is less than a billion. It's, you can get that in a few hours of YouTube. It's really not that much data in the end. So how do we learn this that quickly? Now you could say that the genome encodes a lot into our linguistic abilities and that's what kind of makes us intelligent. But then you realize that chimps don't have language. Their genome is 99% identical to human. And when you quantify this, the genomic difference between humans and chimpanzees can be stored in eight megabytes. You're not, you know, to store a large language model like a 65 billion, you need, I mean, you can store it with 16-bit, but it's still 130 gigabytes, right? So language is an epiphenomenon. It only appeared in the last couple of hundred years. It's been really useful to the human species, but it's basically handled by, understanding language, it's handled by the vernique area, which is a little piece of brain about this big right here. And production is the borca area, which is right here, also this big. That's what LLMs do. What they're missing is the prefrontal cortex. This is what makes us smart, okay, and also animals.