 Hello and welcome everyone to the Active Inference Institute. This is Active Gueststream number 41.1 on April 25th, 2023. We're here with Elliot Murphy and Steven Piantadosi. This is going to be quite a discussion. We will begin with opening statements from Steven and Elliot. Elliot will then lead with some questions and we'll have an open discussion at the end. So, Steven, please, thank you for joining and to your opening statement. Cool, hi, so I'm Steve Piantadosi. I'm a professor in psychology and neuroscience at UC Berkeley. And I guess part of the reason that we're here is that I recently wrote a paper on large language models in part trying to convey some enthusiasm about what they've kind of accomplished in terms of learning syntax and semantics. And in part pointing out, I think that these models really change how we should think about language, how we should think about theories of linguistic representation and theories of grammar, and likely also theories of learning. Yeah. Awesome, yeah, so I'm Elliot Murphy. I'm a postdoc in the Department of Neurosurgery at UC Health in Texas. I read Steven's paper with great interest. I did a lot of people. There were some areas of convergence, but the things I wanna kind of focus on today in responding to Steven and kind of probing how to do with areas of divergence maybe. So, Steven's paper is based on the idea that modern machine learning has subverted and bypassed the entire theoretical framework of Chomsky's approach. So I wanted to kind of respond to some of these main arguments and some other related arguments in the literature that some folks listening might have some insight and thoughts on. So it's a very common criticism to say that large language models just predict the next token, which is obviously a bit of a cliche, right? It's not quite true. They don't just predict the next token. They also seem to confabulate. They seem to hallucinate. They maybe lie. They randomly provide different answers to the same question. They seem to stochastically mimic language like structures. They sometimes correct themselves sometimes when they shouldn't. If you push them a little, they kind of change their mind sometimes. In fact, if Fox News is currently looking for a replacement for Tucker Carlson, they could do less. They could definitely do worse than using ChachiBT if they're looking for a similar caliber. So these models seem to do all sorts of wild things. And over the past 10 years, there's been a sequence of different systems developed like Word2Vec, GloVebert, and each of them is based on a different neural net approach. But ultimately they all seem to take words and characterize them by lists of hundreds or thousands of numbers. So the GTP3 network has 175 billion weights, 96 attention heads in its architecture. And as far as what I know, maybe Stephen can correct me here. We don't really have a great idea of what these different parts really mean. It just seems to kind of work that way. Like attention heads in GTP3 can pay attention to much earlier tokens in the string in order to help them predict the next token. But the whole architecture from start to finish is kind of engineering based motivations. And I always kind of wonder what about all the models that kind of failed from these LLMs, from the different tech companies? It's like these companies often seem to, you know, make it seem like they have these models that really work very well straight out the box. And they all seem to be named after some kind of famous artists, right? They have Dali after Salvador Dali. They have Da Vinci, maybe pretty soon, one of these companies will release a large language model called Jesus or something, I don't know. But they always say, here's our new foundation model. It's called Picasso. It's the first one we tried. It works just great. No problems straight out the box. But I always wonder, what about all the black boxes that have kind of failed every time? That doesn't seem to be a kind of a very open and clear structure to the kind of scientific reasoning behind selecting one model or another. But again, I'm open to be corrected about that. So even basic language models do pretty well on basic word prediction. So the issue is whether these tools provide any insights into like traditional psycholinguistic notions like grammar and parsing. So this is really why I kind of prefer the temp corpus model rather language model suggested by people like Saba Varys. So it's been pointed out that no one really thinks LLMs tell us anything profound about Python when they learn Python code just as well as natural language. Well, Python is a symbolic language with a phrase structure grammar. And nobody says LLMs are unveiling the secrets of Python, right? So just to quote Varys here, he says, if A&N models can be construed as explanatory theories for natural language based on their successes on language tasks, then in the absence of counter arguments they should be good explanatory theories for computer language as well. Therefore successful A&N models of natural language cannot be used as evidence against generative phrase structure grammars in language. So corpus model is really a more appropriate term for other reasons too. People like Emily Bender and some others have shown that features of the training corpus, in fact, I think Stephen cites this, you cite this in your paper actually as a limitation. They show that features of the training corpus can heavily influence the learning process. So it's been shown that the performance of large language models on language tasks is really heavily influenced by the diversity of the training corpus. But natural language itself is not biased, right? It's just the computational system. Human beings can be biased in what they say and how they act. But natural language itself isn't biased, right? So large language models, therefore it seems difficult for me to, you know, agree that they are being subject to all sorts of biases. They therefore can't really be models of language, they're models of something else. So just to kind of wrap up this argument, you know, even though LLMs are clearly exposed to vastly more linguistic experience in children, again, this is something else that Stephen can see it and talks about in his paper. Even so, their learning outcomes may still be relevant in addressing what grammatical generalizations are learnable in principle. So I do agree with this statement here, you know, that in principle they can tell us something about learnability rather than things like, you know, broad acquisitionist frameworks. But that's about as much I think you can maybe say right now, showing that some inductive biases are not necessary for learning is not really the same thing as showing that it isn't present in children. So there's been a long debate about whether, you know, negative evidence and instruction and correction and feedback during language learning are necessary or even useful for infants and children. But right now I kind of agree more with Eugene Choi and Gary Marcus and others who've highlighted how LLMs are currently, you know, very expensive to train. They're clearly an example of concentrated private power in the hands of a few tech companies. Their environment impact is massive. And, you know, many of people have been less constrained and conservative in their assessment here, much less so than Gary Marcus and Eugene. So Bill Gates recently wrote that chat GPT is the biggest tech development since the graphical user interface, the GUI. And Henry Kissinger wrote in February in the Wall Street Journal that as chat GPT's capacities become broader, they will redefine human knowledge, accelerate changes in the fabric of our reality and reorganize politics and society. Generative AI is poised to generate new forms of human consciousness. So very radical claims happening at the moment. And I do wonder if sometimes all of the AI hype may have, you know, see it into certain portions of academia potentially. A lot of ground claims being made. But I think, you know, more concretely, just to put it back to Stephen here, I wanted to maybe raise the issue of, there's a critique by Roski and Beaumont that I think he's read on Lingbus. I think you saw on Twitter that you don't like the response they gave because the objection that they made is that, you know, science is an example of deductive logic. Your objection is that science isn't deductive, it's inductive, right? But I think their general point might be more accurate, namely that you can't use the fact that language models do well predicting some linguistic behavior in humans and some neuroimaging responses. You can't use that alone to claim that they can yield a theory of human language. So in your paper, Stephen, you know that it seems that certain structures work better than others. The right attentional mechanism is important. Prediction is important. Semantic representations are important. And that's what we can glean currently based on these models, right? But so far that's really all I've been able to glean in the literature. I'm not sure if you have more insights here. So Roski and Beaumont used the example of poor prediction but strong explanation, right? Explanatory power and not predictive accuracy forms the basis of modern science. I don't want to explore this a little bit later, maybe, but modern language models can accurately model parts of human language, but they can also perform very well on impossible languages and unnatural structures that humans can't learn and have great difficulty processing. And I know you're familiar with these criticisms, right? But you're definitely not alone here at the same time. So Ilya Tskeva, the chief scientist at OpenAI, he said in an interview recently, what does it mean to predict the next token well enough? It means that you understand the underlying reality that led to the creation of that token, which is quite divergent from a lot of more conservative claims in the literature here. And also, I would just say in response to that, that different components of science can be either inductive or deductive, right? It's not really an either or. You have an existing theory. You formulate a hypothesis, you collect data, you analyze it, and that's kind of a deductive process, but there's also cases where you start with a specific observation, you find some patterns and you induce general conclusions, right? And then there's abduction where you magically invent hypotheses and reduce the hypothesis space. You wouldn't really say that deductive reasoning is unscientific or inductive reasoning is unscientific or abductive reasoning is unscientific, right? These are all just different ways of doing stuff. I mean, in your paper, you give the examples of using models to predict hurricanes and pandemics as being examples of stuff that is as rigorous as science gets. And then you employ your reader to conclude that the situation is no different for language models. But I guess for me, the issue is that models predicting hurricanes are not in the business of answering the question, what is a hurricane, right? Models accurately predicting the weather are very accurate, but they're not, you know, they're aligned with the meteorology department, but they're not a substitute for it. So I guess I'll just, you know, hand it over to you. Yeah, okay, well, there's a lot there. I guess I could start just by saying that I agree with like many of these criticisms, right? About these models being controlled by, you know, one or two companies that being very, very problematic. You know, they have all kinds of biases that they've acquired because they're trained on text from the internet. That's hugely problematic. You know, I certainly agree that there's things at least at present that the models don't do well, right? So I think it's easy to find examples of, you know, questions and problems that will trip them up. I think why I've been excited about them though is not necessarily in those terms, right? But in terms of performance on language, specifically syntax and semantics, I think they're far beyond kind of any other theory in any other domain, right? So there's no other theory out of linguistics or computer science, which can generate, you know, long coherent grammatical passages of text. And so kind of admitting all of their problems as, you know, tools or things which are deployed by companies, there's still this question of like, how are they at dealing with language? And I think this is where a lot of the enthusiasm comes from is there really hasn't been anything even remotely like them in terms of linguistic ability. And that's the thing that I think is exciting. So yes, I agree with a bunch of these things you started with, but nonetheless, like I think in terms of syntax and semantics, there's just no other theory which is comparable to them. But so let me push that back then, right? So I would, the main objection from a lot of people I've spoken to in the departments of linguistics who are like a lot of the general, you know, first of your paper is to really say, well, you're right, they do a wonderful job accurately modeling all aspects of, a lot of aspects of syntax and semantics. However, I don't know of any real, just like, you know, Chomsky talks about facts about language, which is an old fashioned notion. But I really think that's kind of an important notion too, right? Is there some discovery about language itself that LLMs can uniquely provide? So like if LLMs made some prediction about, let's say you have a sentence structure type X being more difficult to process than sentence type Y. And this is a unique prediction that only they'd generated. And no human linguist, Chomsky, Hornstein, Adger, none of these people had ever predicted that before, but it turns out to be true. You do eye tracking experiments, you do all sorts of different behavioral experiments. And it turns out, oh, you know, after all, it turns out to be true. This is the new insight about language processing. It's a new insight about language, you know, behavior. I just wonder, I'm not saying, I'm not saying that this is not possible in principle, because it might happen in the near future. But that's, I guess for me, the crux of why a lot of linguists speaking up, speaking on behalf of the entire linguistic community here. And you know, I guess that would be one of the main objections. Yeah, I mean, I don't know of, I guess I think of the insights they've provided as kind of general principles, right? So I think about these things like the power of memorizing chunks of language, right? So like they seem to be very good at constructions, for example. And there's lots of linguistic theories, Chomsky's in particular, right? Which are about trying to find kind of minimal amounts of structure to memorize, right? Trying to derive as much as possible from some small set, some small collection of operations. And I think that hasn't gone well for those theories, right? Whereas this goes really well, right? So if we think about something which has the memorization abilities, if we think about theories of grammar, for example, which build on humans like really remarkable ability to memorize different constructions, right? Or different words, you know, tens of thousands of words, tens of thousands of different constructions, sorry, tens of thousands of different idioms, maybe our theory of grammar should be integrated with that. And there in some sense, a kind of proof of principle that that kind of approach can work well, right? Can think about making other types of predictions with them, some of which people are currently doing, but for example, trying to use them to measure processing difficulty, measure surprise, for example, from these models. Their surprise measures, right, are much better than, say, context-free grammars or other kinds of language models. And then it's an interesting question how those surprises or predictabilities relate to human processing, right? And it may capture some of it or it might be non-linear or it might only capture a little bit of it or whatever. That's an interesting kind of other scientific question. But I think in principle, right, they can make predictions about, for example, the connections between sentences, right? So in the paper, I gave this example of, you know, converting a declaration into a question in 10 different ways, right? And presumably when GPT or something is doing that, it's finding 10 different questions which are all in some way related, kind of nearby in the models underlying semantic or syntactic space. And so those kinds of things are of the type that I think some linguists might want, right? Which is here's some hidden connection between sentences or their structures. But as far as I know, they haven't been evaluated empirically yet. So, yeah, yeah. I mean, these kinds of models are only a few years old. So I think it's reasonable to be excited about them even though this kind of work hasn't been done yet. No, that's right. No, totally. I mean, I think that's the right perspective to take. But I think this gets to the issue of the, you mentioned surprise or you mentioned learnability. You know, LMS learn some syntax, but they do so. We've obviously way, way more data than infants do. Such that observations of potential structure in and of itself is not a refutation of the poverty of the stimulus. Well, the weaker version, I should say, of the poverty of the stimulus argument. So the main fact that LMS can do what they do without grammatical prize is very striking, I agree. And in fact, you wouldn't have predicted that maybe five or six or seven years ago. But it doesn't yet invalidate the claim that humans have such a prize and we bring those prize with us. So in order to see if computational linguistics can constrain hypotheses and theoretical linguistics, which I think it can do by the way, this needs to be done with, you know, care for experiments in which different learning parameters are controlled and gigantic language models like GPT3 are basically useless here. So this gets to some of Tarlin's complaints that we need something like a baby LM project, which I know you're interested in, where we have more ecologically valid training sets. You make the prediction in your paper that some structure will be learned from that. I suspect you might be right there. But even so, even with the baby LM challenge, there's still the kind of non-trivial issue of addressing more traditional issues like when the kids start to generalize based on the amount of current input, based on different factors cross-linguistically and that requires just traditional, you know, psycholinguistics and language acquisition. So LMs, you know, do care about things like frequency and surprise, as you said, but there's a really nice paper by Sophie Slatzen, Andrea Martin, a really beautiful paper that I think you may have seen that shows very nicely that distributional statistics can sometimes be a cue to moments of structure building. But it doesn't replace these notions pertaining to composition. So I'll just read a quote from Chomsky 57, which sounds a lot like what Slatzen Martin said, despite undeniable interest and importance of semantic and statistical models of language, they appear to have no direct relevance to the problem of determining or characterizing the set of grammatical utterances. I think that we are forced to conclude that grammar is autonomous and independent of meaning and that probabilistic models give no particular insight into some of the basic problems of syntactic structure. So that second hedge of the second sentence turned out to be incorrect. But it's so it's true that, you know, what Chomsky said of available stat models in 57 is no longer accurate when applied to models today. That can make abstract generalizations about novel strings and distributional categories, as you mentioned, right? But the performance of a single model does not provide direct evidence for or against the inability of a particular structure by giving the vast distance between any computational model available today and the human brain. Model success does not mean that the structure is necessarily learned and model failure also doesn't mean that the structure is not learnable, right? Yeah, yeah. So I mean, I think it's maybe worth unpacking kind of a couple of different versions of learnability arguments that people have made because there've been very, very strong kind of impossibility claims coming out of kind of Chomsky's tradition, right? That were never claims about the amount of data that was required, right? They were claims about the logical problem of language learning and that it was just impossible, right? It was impossible without having kind of substantial constraints on the class of languages or the class of grammars that you would acquire. And people for a long time have been arguing against that version of things. You know, there's old work by gold and then there's whole kind of grammatical theories of acquisition built on that tradition that worry a lot about the kind of order in which you traverse through different hypotheses and consider different options and things. And my favorite reference in this is this paper by Nick Chater and Paul Vitani called something like ideal learning of natural language. That basically shows that a unconstrained learner could with enough data acquire the kind of generating rules or the generating grammar just from observing strings, right? But that paper was really in response to this huge body of work that was arguing that learning from positive examples, so from just observing strings was like logically impossible, right? So of course, people in Chomsky's tradition really liked that form of argument because it was one that said you had to have something innately specified in order for language acquisition to work. It was like kind of a mathematical argument, right? That you had to have some kind of innate grammar and innate ordering of hypotheses or something. And all of that just turned out to be totally wrong. So if you move to slightly more kind of realistic learning settings which Chater and Vitani do, then it turns out like an idealized learner can acquire stuff. And there's no statements about the amount of data that's required even there, right? That's the kind of pure logical ability to learn. And that ability is what I think the big versions of large language models also speak to, right? So Chater and Vitani and other work kind of in that spirit is mathematical and kind of arguing in principle, but never created something which was really a grammar, right? Or a real kind of implemented language model. So even, you know, a model which is trained on 100 million or 100 billion or however many tokens, right? Even that kind of model I think is relevant to that version of the debate, right? And showing that language learning is not impossible from a very unconstrained space. Okay. And then there's a second version, right? Which is can we learn language with the specific data that kids get, right? And that's both amount of data and form of the data. And so for people who don't know the Baby LM Challenge is this, sorry, we think to call it a competition or a, I guess it is a challenge trying to get people to train language models on human sized amounts of data. So that's something more like, I think there's two different versions, 10 or 100 million different, 10 or 100 million different words in the training set, which is like, you know, 100th or 1,000th or something as big as these big AI companies are using for their language models. And I think actually it's like, that's exactly the right kind of thing and exactly what the field needs, right? Because you might find that on a child sized amount of data, you can essentially learn syntax, right? Which I think would be the strongest argument against these poverty of stimulus claims. You could alternatively find that maybe you can't learn very much. Maybe you come up with a much crumbier kind of language model where it's lacking some syntactic or semantic abilities. I actually think that the failures they are a little bit hard to interpret because kids data, when they're actually learning language, they get a lot more data than just strings of sentences, right? They're interacting in an environment. So there's stuff in the world in front of them. Their utterances are also interactive, right? So you can say something and see whether your parent brings you the thing that you asked for, for example, right? That's long been argued by people as a, you know, important cue in language acquisition. So in the baby LM challenge, there is an ability to train these models with kind of multimodal input. So I think you can give them as much video data as you wanna give. But probably it's hard to kind of replicate exactly the type of setup and feedback that kids actually get. So I don't know, you know, I'm excited to see where that goes and how things pan out there. You know, I think that there is an interesting related question for large language models, which is understanding exactly what all of the data is doing. So it could be that you need so much data for these models because they're effectively inventing some form of semantics internally, right? So they're both discovering the rules of syntax and they appear to be learning quite a bit about weird meanings. And it's not, it's totally unclear, I think, how much of the data in these modern models is needed for syntax versus semantics. My own guess I think would be that the syntactic side is probably requires much less data than the semantic side. Actually a student, a former student of mine, Frank Malica, and I wrote a paper a few years ago trying to estimate the amount of information a learner would necessarily have to acquire for learning the different aspects of language. So you have to learn all the words and you learn their forms, you learn their meanings, you probably know their frequencies, you have to learn syntax. And basically what we found in that analysis, basically just a kind of back of the envelope calculation for each of these domains is that syntax is actually very few bits of information. It doesn't take that much information to learn syntax, whereas like most of the information you acquire is actually for semantics. So specifying 30 to 50,000 different word meanings, even if each meaning is just a few bits, right? Like that requires a lot of information and probably each meaning is more than a few bits, right? So it could be like that would make me guess that what's happening with large language models is most of their training data is about word semantics. And you can think about other ways that kids get word semantics, right? That's not just kind of co-occurrence patterns in text. But I agree all of that is up in the air and really exciting to see what will happen. Yeah, I know that some of the earlier results from Linsen's lab suggests that at least restricted to equitably valid training sets sides, models seem to generalize linear rules for English question information rather than the correct hierarchical rule. So I think there's a real sense in which the space of the correct syntactic prize and inductive biases is yet to be really settled on, but it seems, at least to me, pretty obvious that there has to be some. So there's also some evidence that children in English going back to this frequency issue that children in English sometimes spell out an intermediate copy of movement in the specified position of the lower complementizer position of a long distance to age question. So there's a thesis by Thornton at some of the papers about this. So they say, which person do you think who did that rather than which person do you think did that? So this is an interesting, you know, mis-setting because some languages do actually spell out these intermediate copies, but English doesn't. So the kid makes the error in setting their grammar, but the frequency of the input is actually zero. So our mutual friend Gary Marcus also has an argument against frequency determining a kid's output in the case of German noun plurals. A more regular form of the certain kind is preferred, not the frequent one. And there's lots of examples like this. So it's sometimes claimed that subject experiencer passives where the subject is passively experiencing something are very delayed in kids in comprehension studies until around eight because they're not very frequent in the input. But Ken Wexler and colleagues have gone through subject experiencer double H questions like who likes Mary and they discovered that these are as infrequent in the input as subject experiencer passives, but kids have no problem in comprehension studies of these questions, but they do have problems comprehending subject experiencer variable passives. So frequency once again seems to be irrelevant or at least it's not explanatory, right? I guess it's not explanatory with respect to theory building. So how can LMS help with these, you know, diverging cases when there's clearly something else going on besides frequency? So LMS, you know, they seem to generalize just again, going back to this issue of the cases that you have in your paper. You show that they generalize the structure of color screen ideas, which is obviously very cool. But the positive stimulus has never really been about not being able to learn language statistically. I know you made that claim, right? But Chomsky's point in the 50s about statistical models of the day is not true of commercial LMS in 2023. And that's correct. Well, we can't use that single point to undermine, you know, the entire generator enterprise. Chomsky's basic point was that you could have a grammatical structure wherein every diagram has zero frequency and it also fails to provide clearly interpretable instructions to the conceptual interfaces. So interfaces of other systems of the mind. So as you're showing your paper, GPT mimics examples like flawless green ideas. But you know, again, this sentence yields over 150,000 results on Google and it's discussed extensively in the literature. It's able to mimic the fact that this doesn't really tell us that much. At least we can't really say anything with much confidence. So, you know, Abiba behind University College Dublin has this quote recently, do not mistake your own vulnerability for an LMS intelligence. And in fact, even Jan McCoon wrote last year that critics are right to accuse LMS of being engaged in a kind of mimicry. And the example sentence is from chat GPT that you're giving the paper, actually don't do a good job because as you say, it's likely that, you know meaningless languages rare in the training data but they can either do it or they can't. Like there's no middle ground in terms of giving those 10 examples like this. So you have colorless green ideas which are very different semantic objects from things like brown shimmering rabbits, white glittery bears, black shiny kangaroos, green glittering monkeys, yellow dazzling lions, red shimmering elephants, right? These are all like semantic, so it's not to be weird and a bit strange but they're still like legal structures. They're kind of meaningful syntactic semantic objects. Right? I just said, yeah. Yeah. I mean, so maybe I can respond to the first point first, right? So you started off talking about these other kinds of acquisition patterns which maybe don't map directly onto frequency. And I think it's actually a mistake to think that kind of modern learning models should be just based on frequency because they're clearly learning like pretty complicated families of rules or constructions or something. And I think it's very likely that when they're learning that they're in some sense searching for a simple or parsimonious explanation of the data that they've seen, right? And how that caches out in a neural network is maybe complicated and it depends on parameters and the specifics of the learning algorithm and those kinds of things. But I think it's, I'd suspect maybe that it's likely to be the case that like they're learning over a complicated set of things, right? A complicated kind of family of rules and constructions. And that means I think that their generalizations maybe like the examples of people that you gave might be kind of discontinuous in the input, right? So sometimes you could imagine seeing some strings which leads you to a grammar and the simplest grammar of the data that you've seen so far is one which predicts an unseen string, right? And if that happens, then you'll be taking the data learning a representation which generalizes in some novel unseen way so far purely because that generalization is sort of the simplest account of the data that you've seen to date, right? I think that's sort of what linguists try to do, right? Try to look at the data and come up with a theory of it and then sometimes that theory predicts some new phenomenon, right? Or some new type of sentence. And so if they're learning over as sufficiently rich space of theories, then it wouldn't be unreasonable or unexpected for them to also show those kinds of patterns. Now, whether they do or not, I think is still an open empirical question, right? Because we have to train them on small amounts of data and test their generalizations and these kinds of things. But I don't think like just the fact that humans do things which are not purely based on frequency is any evidence at all either way, right? Because once you're learning over rich and interesting classes of theories, then that is the expected behavior. Actually, I had a paper about a year ago that I think you're familiar with. Yang and Pianta Dosi, where we were looking at kind of what happens when you give a program learning model strings from different formal languages. So think of like giving a general model just, 10 or 20 maybe simple strings that obey some pattern and then asking it to find a program which can explain that data, which often means finding some way of kind of programmatically writing down the pattern in the strings. And in that figure, we have a paper which is really relevant to this point where the generalizations that that kind of model makes I think kind of qualitatively like the ones you're describing for people, right? Where you can give them a small amount of data and it will predict unseen strings with very high probability, even though there's zero frequency in the training input, right? And the reason it does that is that often the most concise computational description of the data that you've seen is one that predicts some particular new unseen output. So that model is essentially an implementation of the kind of Chater and Vitani program learning idea that I brought up earlier. But it's one that I think, if you think about in the context of these arguments of kids saying unusual or unexpected things, like that is predicted by all of these kinds of accounts, right? Because as long as these things are effectively comparing an interesting space of grammars, then they'll show that kind of behavior, I think. So, okay. So I guess the argument would be that at least from the gender perspective, syntax is functioning separately, but it's still maps to semantics. It informs pragmatics, right? So in the minimalist program, syntax is obviously minimal, it's very small. It's just a linearization and labeling. They're the two only operations. A linearization algorithm to central motor systems and some kind of categorization algorithm at the conceptual systems. So Chomsky's architecture is kind of reliant on the process of mapping syntax to semantics, right? It's form meaning regulation. It's not just stricture and it's not just meaning. So LMS don't really have this mapping process, right? Like where's the mapping to semantics? And if there is a mapping, what does the mapping process look like? What are the properties of its semantics? You know, what are the properties of the semantics placed on their own sets of constraints on the mapping process? Like they do for natural language, are they kind of, you know, do these kind of constraints inform each other? Is they kind of a back and forth process, right? Like LMS don't really seem to describe this form meaning pairing, right? Which meanings, which strings, for example, right? Well, sorry, are you saying that they don't have semantics at all? Or are you saying that there's just not a clear delineation between how the structures get mapped onto the semantics? Yeah, the latter, right? So they clearly have potentially some kind of semantics. I know you've argued for conceptual role theory being relevant here, right? The rest of it is maybe a little bit more mysterious. But the actual, so in linguistics is the very, there's a theory of the mapping process itself. It's explicit and you can see it in action and you can test different theories of it in psycholinguistic models and what have you. The actual regulation, the kind of constrained ambiguity ambiguity in the sense of, you know, one word multiple meanings or one structure, multiple interpretations, et cetera, right? Yeah, I mean, if you think they have semantics then I think they have to have a mapping from the syntax of the semantics. I agree it's not as like, nobody really understands how they're working on any deep level, right? So I agree it's not as clear as, say in generative syntax and semantics, right? Where, you know, you kind of write down the rules of composition and can derive a compositional meaning from a sentence from the component parts or something, right? Like that's not how they're working, right? But I just, I wouldn't take for granted that it has to be like that. Like it could be that how they're working is actually how we work, right? That everything is represented in some high-dimensional vector space and there's some complicated way in which that vector semantics gets updated with each additional word or whatever in a linguistic stream. But like, I think it's clear that they have some kind of representation of the semantics of a sentence, right? Like they can answer questions, for example, at least approximately. I mean, it's not perfect, but it's not like an n-gram model or something, right? Which really doesn't have semantics. So I think that they're definitely representing semantics and updating that as they process language, it just happens not to look like these other formal theories. And I guess I don't see why that's a problem, right? Like those other formal theories could just be poor approximations or just totally wrong, right? Yeah, yeah, totally, totally. I mean, there's also ways in which some of the formal theories in semantics are already potentially compatible with what some of these things are doing, right? So another way to think about this is, LMS are, well, LMS are compression algorithms, but natural language understanding is kind of more about decompression. It's disambiguating, meaning X, out of meanings, XYZ. It's all about making inferences about meta relations between concepts that are not in the training data. So some examples that Mill and Mitchell gives are things like on top of, she's on top of her game, it's on top of the box. All of these kind of vary with context. So there's a lot of other things that are going on, right? And I think you discussed some of the examples in your paper. But the fact that the language is still not, at least again, under this theory of language, it's not about string generation. It's about this form-meaning pairing machine. So some semantics in the genitive tradition even think all the rest of semantics is just and, all right, so Pope Petrowski's conjunctivist theory semantics is that human semantics is just and, that's it. Which again, is very simple, elegant, it's interpretable, it's compatible with all the things that are maybe going on in your neck of the woods, right? But regardless, it's still, natural language is still more compositional than things like formal languages just to make a clear distinction that's been made. They have a much richer compositional structure. There's more stuff going on, maybe. So it's been pointed out before that things like attention-based machine mechanisms and transformers allow for combinations of discrete token bindings, which is more approximate to a merge-like comparator than simple recurrent matrix multiplication. But the issue of binary branching of merge, just to choose another example here to talk about the form-meaning regulation, one principle, binary branching in merge is an interesting question, but genitive grammar has always been open to different origins and locations of this apparent constraint in syntactic computation, like where does it come from? Maybe it's a condition on merge, maybe it's imposed by a smooth system, maybe it's a kind of prior, who knows? And in fact, some more recent work in genitive grammar has tried to ground, do away with all of the set theoretic assumptions of merge, right? Maybe set theory isn't the best way to model the genitive grammar. Maybe more logical accounts are more appropriate. There's lots of other recent ideas there, which are all compatible with Chomsky's approach, right? In fact, one of the things that Chomsky likes the most is when he's proven wrong, right? A lot of these theories are going against the core mainstream minimalist architecture. That's true. Yeah, I think so. It's a very diverse, vibrant field. The people who are Adjah, Hornstein, Petrowski, Haji Borre, they disagree in fundamental ways with a lot of what the mainstream of genitive grammar would say, but there's still more scope for disagreement. But it's still compatible with setting core assumptions, right? So a lot of David Adjah's work, for example, deviates in this core respect, but it's still trying to ground these intuitions in different formal systems. I want to get your thoughts again on, I mentioned Mitchell, right? So Mitchell and Bowers 2020, they have this paper, priorless recurrent networks laying curiously, that I think you might be aware of, right? So this is a really good example just to kind of get to the heart of the issue. So recurrent neural networks have been trying to accurately model non-variable number agreement, but Mitchell and Bowers show that these networks will also learn a number agreement with unnatural sentence structures. So structures that are not found in natural language and which humans have a hard time processing, right? So the mode of learning for RNNs is, at least for RNNs, qualitatively distinct from infant, you know, infant homo sapiens, right? So the story is Mitchell and Bowers show that while their LSTM model has a good representation of singular various plural for individual sentences, there's no generalization going on, right? They can represent at the individual level. So the model doesn't have a representation of number as an abstraction. What number is only concrete instances of singular basis plural? So successfully predicting language behavior via LM, or successfully predicting neural responses in a similar way is obviously great, and maybe we can get into that issue later, but there's only one side of the coin here, right? The other side of the coin is explaining why this type of behavior and not some other behavior, why this structure and not some other, and that's maybe Chomsky's most in like, you know, his most important point really, why this and not some other system. So linguistic theory kind of gives you that other side of the coin, right? Whereas LM's really don't. So the Mitchell and Bowers paper does something that- He does it. Well, yeah, so like, take Yael LeCretz and Stanislas de Haines were from 2019, right? They looked at number agreement in an LSTM and found two specialized units that encoded number agreement, but the overall contribution to performance was low. And then in 2021, Yael Kretz have this paper where they show that in the neural language model, it did not achieve genuine recursive processing of nested long range agreement, gender marking in Italian, I think, even if some hierarchical processing was achieved, as I've argued before, right? Some hierarchy was there, it was there. But the question is, is it the right mapping? Is it the right kind of hierarchy? They found that LSTM based models could learn subject-based agreement over short spans, one degree of embedding, but they failed at some longer dependencies. And in their most recent paper, LeCretz et al with de Haines showed that they evaluated modern transformer LMs, including GPT2 XL on the same task. And the transformers performed more similarly to humans than LSTMs did and performed above transfer overall, but they still performed below chance in one key condition, which is the, as I mentioned, the multiple embedding one, the difficult structures. So the reason why I mentioned these studies is because, you know, it's not just to explore the limits of LMs, which is an interesting question, but consider work by people like Neil Smith at UCL, right? He did work in the nineties with a polyglot, savant and neurotypical controls comparing them. So he investigated second language learning of an artificial language containing both natural and unnatural ground structures like the Michelin Bowers paper, right? The whole framework is natural versus unnatural. And they found that while both the savant, Christopher, the savant and the controls could master the linguistically natural aspects, only the controls could eventually handle the structure-dependent unnatural phenomena and neither of them could master the structure-independent aspects. So some weird rules where it's like, you know, you mark the emphasis on the third word of the sentence, things like that. So they argue like, Christopher's abilities are entirely due to his intact linguistic faculties, but the controls could employ more domain general kind of cognitive resources like, you know, attention control, et cetera, which is why they could deal with those difficult processes. But I just mentioned, you know, I'm in that the LSTM in the Michelin Bowers paper approaches natural and unnatural structures in pretty much the same way. So it's not, you know, it's not a psychologically plausible model, I would argue, for whatever humans are doing. And similar observations can apply to the limits of transformer models in Le Creta's work. And all of these themes are like, right up there, they're staying with us all the way to the present. So another one of Tal Lynn's recent papers that I posted a few weeks ago, looking at child directed speech, showed that LSTMs and transformers limited to ecologically plausible amounts of data, generalized, as I mentioned, the linear rules for English, right, rather than the abstract rules. And in fact, more recent way from Lynn's lab last week, looking at, what last year I should say, shows that looking at garden paths, surprise does not explain syntactic disambiguation difficulty, right? Surprise will under predicts the size of the garden path effect across all constructions. And this gets to this issue that you mentioned before, you know, maybe surprise all this related to some aspects of syntax, but maybe not other ones, it's kind of a, it's a very non-trivial issue that is very much, it's open to discussion, it's not, it hasn't settled yet. But so Lynn's ensured that garden path effects are just way more difficult than you would expect from mere unpredictability. So another way of phrasing this argument is to quote a recent argument of Chomsky's to get at this natural basis, unnatural issue. He says, suppose we have an expanded periodic table that includes all the elements that do exist, all the elements that can possibly exist, and all the elements that cannot possibly exist. And let's say you have some model, some artificial model that fails to distinguish between these three categories. Whatever this model is doing, it's not helping us understand chemistry, right? It's doing something else. It's doing something for sure, but whether or not it's helping us understand chemistry is something separate. And I know that you've said in response to some of these studies, I think you've said that, you know, in order to show that something is likely to be impossible, somewhere in your paper, I think you say, in order to show that something is impossible with normal bounds and false positives, you'd need to show, you need to look at something like 500 independently sampled languages. So you cite this in your paper, right? Which you probably can't do, that's just not, it's not a feasible thing to do. So, you know, I'm not too sure that this really refutes the principle argument that I'm making here, right? Because people like Mitchell and Bowers are making an argument about impossibility and principle, not in some kind of extensional sense, you know, just like searching across the world languages to see, to prove across every single language that it is impossible, right? That's kind of, it's a different argument whether it's impossible in some random language on the Amazon compared to actually impossible based on the principles of what the language system is actually doing, like what it can do. So I would just say that, you know, all of it kind of strengthens- I think that that point is that you don't actually know what is typologically not possible, right? So people like to say things like, you know, there's no language that does X, therefore we have to build that restriction into our statistical models. Right. But if it's not statistically justified that there is no language that does X, right? If you've only looked at 20 European languages or something, right? I mean, it's not like that shouldn't motivate doing anything to the models, right? If it's not a statistically justified universal, I think. Well, you know, I think, you're totally right. But that just applies more generally to the social sciences and psychological sciences, right? Like typologically, it's very difficult to establish these things, right? So I guess you're just kind of steel-manu a bit. You're saying that the strong claim is very difficult to prove, right? Like there isn't a language that has X. The strong claim that something is not allowed in natural language is, I think, very, very difficult to prove. And, you know, I think that there have been lots of, you know, strong attempts. There's been lots of strong claims from, often from generative syntax, right? About what all languages do. And I think that, you know, people have been very good at finding kind of counter-examples to a lot of those things. I cite this paper by Evans and Levinson, which actually, you know, I had heard for years about how no language does X and that's what we're using to construct our theories. And that Evans and Levinson paper really kind of changed my mind about this, right? That like language is actually much more diverse than I think most syntacticians will, you know, try to construct theories for something. So, you know, I think we, going back to kind of the, the beginning of what you said, I think we'd agree that you need language architectures which learn the things that kids learn and learn it from data that they learn. And those architectures might be unlikely to be things like LSTMs or, you know, simple recurrent networks or whatever, right? Like, I think all of that work is very useful in kind of honing in on the right architecture. So I'm just trying to remember all of the points you were making, oh yeah. So, but I think this, that there's a kind of flip side to this, which is that I think that the space of things people can learn is actually kind of underestimated, right? Like there's this bias to say, you know, people can't learn X, Y, and Z, but people at least outside of language have this really remarkable ability to learn different kinds of patterns, right? Like the patterns you find in music or mathematics, for example, we can learn sophisticated types of algorithms, right? We can learn to, you know, fly a space shuttle or to, you know, tie knots and for rock climbing or whatever, right? Like there's all kinds of kind of procedural and algorithmic knowledge, which is structural that people are able to acquire. And I think that that notion very rightly kind of motivates looking for learning systems which can work over pretty unrestricted spaces, right? So, you know, you might say that, okay, well, language is different because language is a restricted space. And it might be true that language is restricted, but it also might be true that the things we see in language come from other sources, right? It could be that language is especially pragmatic, for example, compared to music or mathematics, right? And those kinds of pragmatic constraints are the things that constrain the form of language, right? Or language is communicative. It's probably more communicative than music, for example. And that might constrain the form of things. So, I mean, as you know, this is very old debate in linguistics about kind of where the properties of natural language come from. And I guess what I'm trying to say is that there's one kind of perspective where you look at all of the things humans can do even outside of language, all of the rich structures and algorithms and processes we're able to learn about and internalize and you say, okay, maybe language is like that. And then, yes, language also has some of these other funny little properties, but maybe those come from some other pieces of where language comes from, right? It's, you know, we have pretty sophisticated pragmatic reasoning. We're using it to achieve certain communicative ends. You can find all kinds of kind of communicative features within the language system itself. And so maybe some of these other properties are properties that have some other origin. And that view I think could be wrong, but it's one that I think needs to be looked at to see if it's wrong, right? Like, I think it's been kind of dismissed by large chunks of linguists, right? Just, you know, I've heard people say stuff like, oh, well, communication doesn't really explain anything about language, right? And what they mean often is it doesn't explain like the particular island constraints or something that they're working on, right? There's all kinds of other things in language that communicative pressures probably do explain. So I guess my pitch is always for kind of breadth in term, breadth in consideration of the forces that can shape language and not needing to put it all into some form of innate constraints or something like that. No, totally. And I think a lot of that stuff is compatible with the minimalist program because the minimalist program won't syntax be minimal. It doesn't want it to be complicated. It doesn't want it to be any more complicated than it has to be. So there were some, you mentioned the curious properties, right? So there were some of the properties that need to be accounted for in any model of language that are, I'll give you one example, right? The setting of person features. And these person features exhibit very non-trivial generalizations that do not seem to be accounted for via domain general learning mechanism. So I'm citing here the work of Daniel Harbour at Queen Mary. So for example, the morphological composition of person, it's interaction of number, it's connection to space, properties of its semantics and its linearization. They all appear to be strong candidates for our knowledge of language, right? What we mean by knowledge of language. But on the other hand, we have things like case and agreement and head movement. And these are all structural phenomenon. However, they seem to resist a purely meaning-based explanation in theoretical linguistics, right? It would be great if syntax were nothing but a computational engine that builds structured meaning. And that's the minimalist program, the goal. But that's not what we actually find. That's not in any actual minimalist like concrete model, any concrete minimalist theory. The goal is just like, the program is language is perfect. Okay, that's the program. Is that what we find? No, obviously not. Okay, no language actually believes that. So it'd be great if syntax was like that. But I think the program is to look for perfection but not always find it. So case and agreement and head movement are morphological phenomenon. They're properties of the performance systems, what's called performance systems. And so the minimalist program itself is really compatible with a lot of what you're saying about language, there are aspects of language that can be perfected and optimized for communicative efficiency. Absolutely, totally, no doubt about it. But where is that locus of efficiency? Is it in the syntax itself or is it some kind of extra linguistic system? Is it in pragmatics? Is it in century motor? Is it in the speech? Probably the speech and phonology. Probably, I mean, who knows? But I think all of these things demand much more serious consideration into old fashioned notions like structure dependence, compositionality and what have you, things like that which you can maybe find somewhere in the literature but even just basic topics like quantifier raising, extended projections, adverbial hierarchies. All of these things in the minimalist program can be extra linguistic, right? They can actually be outside of syntax and put on very queer properties of the semantic conceptual systems which are in themselves kind of domain general, weird leftovers from ancient primate cognition, right? The features of the way we pass events, the way we pass agents and patients, things like that. That's definitely not human specific. But the way that syntax provides instructions to these systems probably seems to be. So generative linguists have different theories of also language production too. I'll just talk about language production based on whether we store lemmas or whether we build words in the exact same way we will phrase and sentences. So I know that you make distinction between construction grammar and kind of generative grammar and the weight they place on memorizing constructions versus just building things from the bottom up from the ground up, right? So in some generative inspired models, mechanisms which generate syntactic structure make no distinctions between processes that apply above or below the word level. There's no point at which meaning syntax and form are all stored together as single atomic representations. Each stage in lexical access is a transition between different kinds of data structures, right? There's meaning, there's form and there's syntax. These three features kind of coming together and they don't always overlap. Different languages realize them in different ways. And so a word, the basic definition of a word is just this weird multi-system definition where lots of different cognitive systems enrich the basis of every lexical item that you have. There's nothing like this really, this enrichment process anywhere else in linguistic theory, right? Or at least in what LLMs are doing. So I guess I would ask you, what is your definition of a word, right? And what can LLMs really provide insights into weirdhood? Cause if you don't have a definition of what a word is, then you're really in trouble, right? Like we have to at least use LLMs or artificial systems to inform what we mean by a word. Or maybe we don't need that anymore. I'm not sure, what do you think? I'm not sure what you mean. I mean, I don't have a... What is a word? Why does that matter? I mean, that's just a convention about how we use the term word, right? What like, I mean, you could use, you know, LLMs or word firms or whatever, like that just feels like a conventional choice. I'm not sure what's at stake there. So how would you, I guess I would say, I agree, word is a conventionalization, you know, our intuitive concept of word is often biased by orthography, the way we put spaces between things, right? So, I agree with that criticism, you know, word in the intuitive sense is not really a scientific construct. However, I guess, let me rephrase my question. How would you decompose the intuitive concept of word into something that is more kind of, you know, scientifically amenable or psychologically plausible? Which is exactly what Gemative Grammar tries to do by decomposing words into, you know, distinctive features, morphological categories, conceptual roots being merged with categorical features, you know, you get a concept, you know, and you made it with a noun or a verb category to get a noun or a verb. These different models make different predictions, right? Yeah, I mean, I think that general idea is likely to be right for large language models. Like I think they kind of must have things that are kind of like part of speech categories, for example. And I think that they kind of must be able to update those, their categories based on the language that they've seen so far, right? So like, you know, GPT puts nouns and verbs in the right places. And to do that, you kind of need some representation of the nouns versus the verbs and you need some ability to locate yourself in a string of other words and figure out if there's likely to be a noun or a verb next. So I think that on that level, those kinds of properties of words are very likely to be right and there are also things which are very likely to be found kind of in the internal representations of these models. I don't see how it could be any other way other than that. But like, as far as I know, that's not where the main debates or disagreement I think is, right? Like, I think all theories of language have to say that there's different kinds of words that can show up in different places or something like that, yeah. Okay, so how about the issue? You mentioned communication, right? So, and you're totally right. When Trump says things like language as a thought system or language didn't evolve, he's kind of being a little bit cheeky. He doesn't really mean that. He kind of means it in a very specific sense, right? When we say language as a thought system, what we mean is we're trying to get it an architectural claim. So if you look at the architecture of the minimalist program, the syntactic derivation and the conceptual systems are literally different systems, right? The conceptual systems take stuff from syntax and then does its own business with it and the CI systems have their own peculiar rules and principles, which is why thought and language are both similar symbolic compositional systems but in different ways. Only a subset of thought is properly called the CI interface system. Since the CI systems are by definition, whatever conceptual systems you and have that can access and read out instructions from syntax. And we don't know what they are fully. They seem to have something to do with events and grammatical reference and definiteness. They seem to be the main categories that language cares about conceptually, but we don't really know. That's kind of just a hypothesis, right? But what we do know is that they don't seem to make use of color all that much or so no language morphologically marks shade color. All the conceptual features like worry or concern, like no language morphologically marks a degree of worry or concern about an issue. But we do make use of epistemological notions like evidentiality and things like that. So I guess what I'm saying is the minimalist program does a good job of trying to figure out which aspects of thought language is intimately tied to and which aspects of thought it's not tied to. So the minimalist program allows us to kind of carve that up quite neatly. And this is a much more nuanced framework than when Chomsky says language is thought again. He doesn't, maybe he means it, maybe he doesn't, but that's not what the actual architecture of his theory says. It's a rhetorical device that is very useful and interesting to attract undergraduate audiences. But if you look at the actual theories that are coming out of the minimalist program, no one really believes language equals thought, right? The language system seems to, it tries its best to access and reformat and manipulate various conceptual systems, but it has its limits, right? We know what systems, spell keys, core knowledge systems are hooked up to with respect to the syntax engine and which ones are not. So this kind of gets back to the idea that lexicalization of a concept seems to maybe alter it in some way. It kind of imbues it with elements that are not there in the concept itself. So if you lexicalize a concept, you suddenly transform it a little bit, you give it a little extra, you sprinkle something else on top of it, and that seems to vary across different noun types. So these are all like very clear architectural claims within Gem2Grammar that make very clear empirical predictions. So in other words, I guess what I'm saying is all these neuropsychology studies that are up in sighted, you know, in a lot of work in this fame, what does it really show? I think it shows that, you know, when language is damaged in the brain, it loses its particular sway or mode of influencing those systems. But there's no real prediction from within the Gem2Grammar enterprise that those non-linguistic systems should be impaired or should suddenly shut down if the core language system is compromised, right? In fact, if anything, that just emphasizes the principal divorce between the syntactic system and non-linguistic systems, right? So I think a lot of predictions here from the language and communication literature are kind of missing the point of the architectural claims. I can just give, or Daniel, do you want to go? No, go ahead. You just give a little bit of background there. So there's these papers from Ev Fedorenko and Rosemary Varley that are examining in part of them aphasic patient. So people who have impaired linguistic abilities, basically showing that with impaired linguistic abilities, you can still have preserved kind of reasoning abilities. So people like chess masters, chess grandmasters, for example, who are obviously very good at reasoning might not have kind of intact linguistic abilities. And then complimenting that kind of patient work, there's also work from EBS Lab showing that the parts of the brain that care about language are separable from the parts of the brain that care about other domains, even ones you would seem kind of language-like. So things like music and mathematics tend not to happen in the language areas. So Evan and others have argued that this is basically evidence against the Chomskyan claim that language is the medium for thinking, right? Because there's thinking that can happen in the absence of language and the brain areas that care about language seem not to be the brain areas that care about thinking. I guess, Elliot, you're saying that people don't really believe that. They don't believe that distinction, I mean, that. No, and also there's a lot of self-contradiction even within these arguments, right? So in your paper, you sometimes say that Chomsky thinks that language is a thought system, but then a few pages later, you'll say, Chomsky also believes that syntax is some totally separate system from anything else, right? Autonomy of syntax, et cetera. So which is it? Does Chomsky think that's not my contradiction? I mean, he said both of those things. Right, exactly. So therefore, you may want to ask yourself, does he really believe these things or what is the prediction that arises from the architecture, right? So just saying language is a thought system, what does that mean? That doesn't mean anything. It's just a very vague statement. The question is how exactly is language contributing to thought and how is it not contributing? Yeah, I mean, I think his claim is mainly evolutionary or something, right? That this is the origins of the system, which I think is sort of equally hard to square with the kind of patient and neuroimaging data. But, you know, if he doesn't think that, then he shouldn't say it. Or people will respond to what he said, I think. Well, no, because the argument is that language is a kind of thought system. It regulates some aspects of thought and it yields some aspects of thought that are clearly unique to humans, but it's not intrinsically or causally tied to it. The architecture of the system is very different from the kind of generalizations you can rhetorically evince from the architecture. So for instance, when you cite work from a phasic patients showing no deficits in complex reasoning, as you just mentioned, playing chess and so on, we would actually expect this under a kind of, you know, non-lexicalist framework of generative syntax as I said, meaning syntax and form. Form just meaning anything that you can externalize language in. All these things are separate features and separate systems, right? The autonomy of syntax doesn't mean, you know, what a lot of people think it means, it just means that there are certain syntactic operations that are not semantic. There are certain things you can do with syntax that you can only do with syntax and you can't do with semantic. So this gets back to the difference between, you know, Petrowski's theory that semantics is just and, right? Versus the, a lot of syntacticians believe that there are certain peculiar weird things you can do with syntax that are just syntactic. So there is a divorce even within the kind of architectural framework. So it's not too surprising that you also find that divorce at the neuropsychological level, I would say. Well, I think I would want a prediction of the language is thought evolutionary idea then, right? So like, if that's not, if you're saying that that doesn't predict that thought relies on language, then I think whoever likes that theory should come up with some predictions about, you know, what that theory actually means. I mean, I feel like those kinds of predictions are often really necessary for understanding the content of a prediction. So sorry, Daniel, your hand's been up for a while. No, it's all good. I just kind of wanted to bring a breath in and an opportunity for anyone to ask any other questions. But wow, thank you both for the many topics we've covered. We'll have in the last minutes a kind of conclusion and next steps, but Dave, would you like to ask a question or just give a short reflection? Okay, no, there are many comments in the chat. So I hope that both of you can read them on your own time to see what everyone added. Where do we go from here? As we roar into May, 2023 and beyond, what can linguists, large language model developers and users, cognitive scientists, what do you each think are some of the most fruitful pathways forward? Well, I would say, the most fruitful pathway forward is to really take like cognitive psychology. Seriously, there's all of nice work recently trying to align things like chat GBT, Wolfram Alpha plugins, the way that chat GBT can interface with different kind of modules. The way of building a legitimate kind of AGI system doesn't necessarily have to be psychologically reliant on the kind of modules that human beings have, but I think it will benefit from it. So there have been some claims that large language models can maybe do all sorts of things, right? Everything you're like. But I think in the long run, it's most likely gonna be the case that LLMs can do something very important and very interesting, but it's only gonna be one piece of the puzzle. So in fact, even OpenAI CEO Sam Alpin said last week that what we can do with LLMs has really kind of been exhausted. We need new directions, new avenues and so on. I guess he was probably speaking to investors more than linguistic students here, but I think he's also right. LLMs can do something spectacular, but they're probably gonna form a small part of the general AGI architecture, right? If you wanna think about AGI as a potential goal here. I think a lot of the... So let me give me another example here. So Anna Ivanova, who's a very good cognitive scientist, she has a paper recently arguing for a kind of modular architecture for LLMs, which is a very nice framework, right? It's very cognitively plausible. It's exactly the kind of thing that we should be pushing for. It's compatible with Howard Gardner's notion of multiple intelligences and so on. But I think at the same time, just to finish this comment, there was a tech talk last week, I think, or maybe a few days ago, where a lot of this stuff can be conflated with AI hype in an unproductive way. So Greg Brockman from OpenAI, he gave one of his big TED talks, where he showed different plugins that ChatGPD can do. I mentioned Wolfram Alpert, right? But there's also things like image generation, Instacart shopping, where you can get ChatGPD to buy you things and what have you. And again, this takes you back to the idea that multiple subsystems can do different sub-functions. So Brockman also showed an example of giving ChatGPD an Excel file, a CSV file, from an archive database of academic papers, where it just listed a bunch of papers and then titles and what have you, right? And he said that using ChatGPD, it uses world knowledge to infer what the titles of the columns mean. So we understood that title means the title of the paper. It understood that authors mean the number of authors per paper. It understood that created means the date the paper was submitted, right? And because it said TED talk, the audience gave us a standing elevation, right? But the ability to describe labels on an Excel file is, I guess, nice. But I'm not sure you'd really call it world knowledge. So I guess, I would just say there's a lot of progress needs to be made alongside reducing anthropomorphism. You have to have the right balance of it. So like I said, you have to have the right balance of psychologically plausible kind of modular architecture, but you can't have too much anthropomorphism because then you'll get carried away. You have to find, we have to find the right balance between modeling kind of human-like modular systems but not doing it to a degree that is a bit implausible or scientifically unhelpful. I mean, I think I agree with all of that. I'm really excited about these ways of kind of connecting language models to other forms of information processing, which does seem like what people have. I think I've been like very surprised at the things they are able to do just as language modeling, right? So different kinds of reasoning puzzles and things that they can solve, I think is really fascinating and maybe will require us to rethink the relationships between language and thought and try to figure out a way of being specific about what it means for something to have a representation or to reason over that representation. But ultimately, I think I agree that people have different modes of thinking about things and that seems important for intelligence. I'm also super excited about the BabyLM challenge. So I think on the kind of linguistic side, right? That's exactly the right thing of seeing how far we can get with smaller datasets and maybe eventually after that, trying to understand some more about the kinds of semantics that kids acquire and where they get it from and how kind of external semantics can inform language learning or specifically maybe grammar and syntax learning. I guess my other path forward point would be that there's like, I feel like these kinds of models have really gone far beyond people's expectations for this kind of class of model, kind of ground up statistical learning, discovering patterns in text seems to give like really pretty remarkable results. And that for me going forward, I think has just introduced a huge wave of uncertainty over theories. So I think that our theories of basically everything in language for sure, but cognition, probably neuroscience, like all of those things I think are going to be reworked when we really come to kind of understand the ability of really general kinds of learning systems like these. So that makes it, on the one hand, kind of a bummer for past theories, right? Especially theories which relied on learning not being able to work well. But on the upside, I think it makes it a very exciting time both for AI and cognitive science and linguistics, where now there's these really, really powerful tools that seem like a qualitatively different size step towards human abilities. And I think kind of integrating them and taking both the kind of engineering lessons and the kind of philosophical lessons about how they're made and what kinds of principles go into designing intelligent systems. I think that those things will really shape the field over the next five or 10 years. And also, like I would just say in the context of broader themes here, right? Like you're totally right. Like I remember when I was reading about when deep blue be Kasparov, was it? The chess thing, right? And there were some commentators who said, chess is over. If an AI can be a human, then it's game over. What's the point in studying chess? There's no need of boring anymore. And I guess if AI has achieved seemingly everything that humans need to do to play chess, what's the point of playing it? But I think if anything, it turned out to increase the popularity of chess, right? There are now many chess celebrities, there's worldwide tournaments. And I would predict that the same is probably gonna happen with language too. LLMs do not mean it's the end of language, no more language, no more linguistics. I would actually push back and say maybe it would be the opposite. The success of LLMs will increase general interest in linguistic theory due to their apparent weird constraints and apparent limitations, right? Cause I would also say, scale, at this point, the chess issue, scale is kind of definitely far from all that's needed. What is lacking is an ability of LLMs to really abstract their knowledge and experiences in order to make robust predictions and generalizations and so on. I gave some examples, but there's some others in the literature where it doesn't seem to really be good at generalizing. It can kind of mimic particular token types. But I would guess my final claim would be that the language acquisition literature doesn't necessarily need LLMs though. Cognitive scientists don't really need LLMs. We could potentially, me and Stephen obviously disagree here, but I would say big tech companies profiting off LLMs need LLMs, right? They're the only ones that really do. It may be the case that the mind is a very, I will say, the mind is a very diverse space. It may be that there are certain forms of behavior and learning that might be captured by processes similar to what LLMs are doing. So Stephen has given some mentioning examples in his papers about magnetism and we are kind of rules of learning that are very, the main general and very quick and very mysterious. So maybe for those sorts of things, that kind of learning will be relevant, but I still think it's unlikely that one of the candidates will be natural language, at least the way natural language works and it's full glory in terms of the form, meaning, regulation and what have you. So I guess I would, it kind of reminds me of where you have this image of, I saw John with chapter four recently, right? And he has this, there's this scene where he's walking in the desert and he's not sure if he's seen this guy that he wants to assassinate. It's kind of like when you walk in the desert and you have an illusion of seeing an oasis because it turns out you're hallucinating. But then you realize that, sometimes before it's too late that you actually are hallucinating. It's, you're not seeing an oasis, you're still in the desert. And I think that's kind of maybe the situation we're in right now with linguistic competence of language models. We have the illusion of linguistic competence but you always see the illusion before you find the oasis, right? So I think right now we're in the hallucinating stage of the desert where we're seeing potential sparks of linguistic competence but it's still not very clear and I'm robust. And we haven't actually reached the oasis yet. Just a rapid fire question. So see if you can give a short response. So Sphonogeno writes, question, is it correct to say that large language models have no priors? Do large language models have priors? I'd say yes, they definitely do. And I think the difference to how people are used to thinking about priors in Bayesian inference, for example, if you like write down a Bayesian statistical model, you say like, here's the parameters and here's what the priors are on the parameters. Large language models, I think the priors are, and maybe neural nets in general, I think that the priors are much more implicit, right? So there's some functions which they find easier to learn than other functions. And there's even some work trying to discover, some statement of what those kind of implicit priors are. But that's actually how I think about comparison of different neural network architectures, right? Which is maybe something Elliot and I might agree on, right? Like you have to find priors which allow them to learn the things that kids learn, right? And not all architectures will do that. Even among architectures which are turn complete or capable of learning any kind of function, not all of them will do it even on kind of huge data set sizes. So I think of this sort of search over neural net architectures as really one of a search over priors. But it's not priors or, I mean, you could think of it as a search over universal grammar or something, right? But it's not priors or universal grammar in the sense that people have talked about it as like an explicit statement about what kinds of rules are allowed or an explicit statement about what kinds of functions or high probability or something like that. It's all implicitly coded there. Yeah, totally. I think that's right. I mean, the real question is reducing the space of what those priors are like. And if it's anything remotely like what human beings are doing. So LLM's like, I would at least say that things like GBT3 are an existence proof of, you know, that building fully functioning syntactic categories from surface distributional analysis alone is possible. That's, yes, that is correct. But, you know, even so I would say most syntacticians don't really believe that syntactic categories are innate. So the prior issue is slightly less relevant here. It's the operations that are said to be innate. So the, in the syntax domain, it's particular linguistic computations that are said to be innate. Categories themselves. In fact, even Charles Young has admitted in the last couple of years that they are maybe innate, but maybe not. So people have given, I know of a relevant prize, they are things like, you know, me and Gary Marcus have talked about compositionality. That seems to be a big problem. So people have given chat, GBT, BBC news articles, asking it to compress it and then re-explain it. So one example I saw was Peter Smith 58 is being arrested on charges of manslaughter and you get it to compress it and re-explain it. And it comes out as 58 people are being charged of manslaughter. All right, that's a pretty clear example of a lack of compositionality being built into whatever compression it's doing. And there's no example where there's been, there's some examples of potential analogical reasoning. So in Bing chat, you know, Bing has this chat function. The question is, is it just finding meta relations that have already been documented by humans or is it genuinely creating new relations that the new stuff that it's being built? So, you know, someone asked through me a table comparing Jesus Christ with the Nokia 9910, right? The cell phone Nokia 9910. And it said, you know, it compared the release dates, it compared the size, the weight, it compared the CPU with Jesus' all-powerful knowledge. It compared the memory of the phone with the all-knowing nature of God, right? It also, I think it said that they were both resurrected because the Nokia was re-released a couple of times, right? So the Nokia's been resurrected. That sounds like a great answer. What's wrong with that answer? It may be, it sounds a lot like analogical reasoning, but then it also had some quite weird ones where it was like, you know, for the camera, it said, no, it just gave Jesus' description, but it's not really what a camera is. There's some kind of things that look like analogical reasoning, maybe, but it's unclear, yeah. I think that sounds like an awesome answer to me. I agree. I was gonna say, like you said large-language models learn they're an existence proof of part of speech categories, but like they don't just output part of speech categories, right? Like they have a lot of grammatical syntactic knowledge and moreover, like they have a lot of semantic knowledge and probably some pragmatic knowledge and you know, they're not bad at translation and like it's way more that they have discovered than just part of speech categories. Well, I'm sorry, I said syntactic, I'm sorry, it's like syntactic categories. Right. Well, sorry, so yeah, yeah, but they've discovered way more than that. Yeah. I'm going to, as a teaser slash motivator for hopefully both of you to join again in the future with or without other guests, a few of the exciting questions just for us to include in this transcript and then thank you both, Elliott and Stephen, for joining. So just a few of the last questions that were asked, Juan asked, how do small transformers, Jiang et al 2020, compare with children learning language? 96 asked, what are your thoughts on implicit priors versus animal instinct? Rojda asked, what constraints that space in LLMs don't they get there by training? So are they discovering it? That's not what they implemented at the start maybe and there's many more questions. So I hope that we can all review and reread each other's works and come together for 41.2 in some future time. Thank you, Elliott and Stephen for this excellent stream. Thank you, Dave. Thank you both. Yeah, thank you so much. Farewell. Bye. See you.