 Great, thank you all for having me here. Everybody in the back can hear well? Great. All right, so first of all, I'm going to talk about work that is broadly one of, open by acknowledging that the work that I'm going to talk about, I'm going to try to cover several different vignettes in cases of the challenges of understanding the incredibly rich communicative and expressive system we have with human language and the state of the art in computational modeling and how we can address scientific questions. This is broadly collaborative research. These are collaborators and lab members, people shaded, or people whose work I'm going to talk about today. But I want to emphasize that it's many people working together and cooperating to make this possible. So the research that I do is focused on fundamental questions in how we can achieve a scientific and technological understanding and mastery of language, human language. This raises some of the deepest scientific questions that can be asked about the human mind. How well, how can we communicate so well with language? So every day, you hear hundreds, if not thousands, of sentences that you've never heard before in your lifetime, and you understand most of them, or perhaps all of them, very well, and you create yourself hundreds or thousands more in your speaking and writing. This is despite all the ways that language understanding could go wrong. So for example, language is ambiguous. So this is going to be a participatory talk, so I'm going to ask you a lot of questions in these response. So for example, the sentence, the woman discussed the dogs on the beach. Who's on the beach? So raise your hand if you think the dogs are on the beach and the women are discussing them. And raise your hand if you think the women are on the beach discussing, but the dogs aren't necessarily on the beach. So it's ambiguous, and that's a typical split in the interpretation. We have mathematical tools. I see this doesn't actually. We have mathematical tools that allow us to describe, structurally, the differences between those two. But actually, that's the tip of the iceberg. Although this is a sentence that I crafted to make it obvious what the ambiguity was, most sentences that you'll encounter in daily life are ambiguous in all sorts of ways that you don't even notice. Of course, we also understand language in the context of noise. This is a relatively clean, acoustically simple environment, but cocktail parties, you're still able to understand the language in real time. Memory limitations. Language can tax your memory. And furthermore, we don't know everything about the people that we're talking to. In fact, this is a prerequisite for communication, because if you knew everything about me and I knew everything about you, you could predict what I was going to say. I wouldn't need to bother saying it. We could all go home. But that's not the case. So incomplete knowledge is prerequisite for communication being meaningful. But of course, if we knew nothing about each other, we couldn't communicate. We wouldn't have shared conventions that we could use to transmit meanings to each other. So how do humans do this? How do humans surmount all these challenges and actually use language to communicate extraordinarily effectively, coordinate our actions, and ultimately build complex organizations and societies and advance scientific knowledge, which is really what this entire event is all about. And furthermore, of course, how can we get machines to do the same? That's the gold technological goal. And the field of computational psycholinguistics, which I work in, tries to bring together complementary knowledge and expertise and approaches to advance the state of the art and our scientific understanding of these questions and also in technology. So we have, dating back from the mid 20th century, in the dawn of cognitive science, we have deep theory of linguistic knowledge, where we have mathematical formalisms that allow us to describe the discrete structures that underlie the relationships between words and a sentence, meaning representations and more. We have contemporary computational models that allow us to deal with reasoning under uncertainty, distributional similarity. We have language data sets. This is a golden era of data. And for example, now we have language data sets that are orders of magnitude larger than the entire lifetime experience of an individual with language. And we have psychological methods for experimentation. So for example, we can look at the real time unfolding of cognitive state through behavioral measures during reading and many more. So that's sort of the landscape. And our work really lies at the intersection of all these things, defining AI, psychology, linguistics, cognitive science, modeling tools, data sets. So I wanna give you a whirlwind tour of a few sort of important features of the landscape for language. And hopefully this will provide insight into what some of the real challenges are, both for advancing our scientific understanding. Because I think that the current state of AI actually puts us in a very good position to make progress on some of the deepest scientific understandings that face us as a scientific community. And at the same time, answers to those scientific questions are ones that will help advance technology. So let me give you a sort of brief picture of what sort of has been going on for the last five minutes as I've started to talk. And has been going on in your minds without you really even thinking about it much consciously. In the general case, when we understand what we hear or read, it's unfolding in real time. That's a crucial feature of human activity in the world in general and specifically with language. So at any particular moment, you've heard, for example, I might be, I'm typically mid-sentence while I'm speaking because most words aren't the beginning or end of a sentence. Of course, this happens in a broader context outside the sentence. There's a larger set of language that I've already produced and also you know a lot about the environment. And at any particular moment, you get some input. So if you're listening to me, you're getting acoustic input. If you're reading, you're getting visual input. That context modulates how you process the current input. It might influence whether you think a particular word, if I said Bell or Pell, depending on the context in which it occurs. All that is modulated. The input is recognized as, for example, a word. It's integrated with a continuum with the context and that goes on. That's a recursive process that takes us through time building up understanding. And one interesting feature that you don't think about but I'll reveal to you is that actually expectations about how that input will continue and what it will be like are ubiquitously formed. And I'm gonna just give you a couple of simple demonstrations of this by giving you sentences that have started and not completed. And you'll find that you'll have strong intuitions about how they will complete and I'll be able to guess them. So for example, if a sentence starts like, Jamie was clearly intimidated. What do you think is likely to happen next? Maybe a buy somebody or something. Does that sound right? Does your hand at that seems right? Excellent, yeah. And the reason for this is a confluence of factors. First of all, the meaning of the word intimidated but not just that. It's also the syntax, the relationship between words and the rest of their environment. So typically, if I just use intimidated on its own, the thing that will come next is the intimidate is the thing that is intimidating, intimidated, not the thing that's doing the intimidating but because this is a passive voice context, you actually get a very different expectation. This is effortless and it's at a relatively abstract level. It's for a specific word and then a kind of phrase that denotes a kind of meaningful relationship with the event. That's one kind of expectation but there are other kinds of factors that influence this as well. So for example, if I begin a sentence, terri ate in, raise your hand if any of these is the things that you think is coming next. And raise your hand if those are the same things that come to mind when you read terri ate a. Nobody's got their hand up. But what about these? Exactly, okay. So this is a case where there's a combination of semantic structural syntactic and also phonological features, whether the word starts with a vowel or not. There's also a more abstract situational semantic knowledge. So for example, the children went outside to play. The squirrel served some nuts in the statue, right? Raise your hand if it was a statue. No, raise your hand if it was tree. Excellent, so this is an interesting feature. Now let me point, actually, where's the most common place that squirrels store nuts? The ground, not the tree. Raise your hand if you thought ground. Fewer people than tree, okay? So it's actually, this is not entirely about just sort of ground truth in the world. And let me point out, actually, statues are pretty much as plausible places to put nuts as trees. They also have just the same kind of nooks and crevices that trees have. But for whatever kinds of reasons, you're actually predicting tree, not statue. So this notion of expectation is not just about what's possible or what makes sense. There's additional content that's going on and that is all part of what's going on in the human mind. World knowledge, general situational knowledge is integrated in fine detail with the structure and real-time unfolding of language. Now, let's look at the AI situation. So in fact, it's turned out that this kind of question of expectations and predictions is actually sort of shaping up to be, in a way, the best way of extracting some kind of distributed knowledge from large language data sets. One way of posing this is provocatively, one might think of in the contemporary natural language processing landscape, word prediction is sort of coming to emerge as potentially the quote-unquote NLP complete task. That is, if we could do a great job on predicting word in a context, you could do everything else well too. You would have extracted the knowledge that would be required to do all the other things you might want to do with language. So let me give you a whirlwind tour of sort of how this works in state-of-the-art kinds of models. So I'll take a sentence, for example, and then I'll mask a word. So this is the way that state-of-the-art models like, you can't see them here because of the chairs, but Bert and Roberta, which you've heard of if you've been following the NLP literature, well, you'll take a sentence and then you'll just mask one of the words. And the predictive problem is that the model has to guess which word has been masked. It's the identity of the word that's been masked. And so in this case, you probably have some guesses. This is actually, for example, might be a razor-yand if that was what you thought of, some people, great. And then structured deep model is set up that over many layers has to extract away generalizations so that at the top, there can be some simple prediction from a distributed representation for each individual word to predict the words, and in particular, predict the words, the word that is been masked. So that's the AI setting. And so now I want to go back to the cognitive science of human language processing and tell you a few things about the kinds of phenomena that we should expect for human-level AI to eventually emerge from a model that these would be benchmarks of success. Okay, so, and this is about structure and surprise and I'm gonna now give you this in an incremental context. So I'll just give you word by word. I'll be silent, see what your experience is. So at the end of the sentence, there was a surprising event. What was surprising about this? Seems like something's wrong with this sentence perhaps. Just think about for a moment for your subjective experience with this. Now, I will insist that this is actually a perfectly well-formed sentence of English. And the fact that it's surprising, despite the fact that it's well-formed, reveals a lot about the cognitive operations underlying human language understanding. To demonstrate that this is a well-formed sentence, even though you may be doubtful, I want to give you slightly different sentence. This one. Can you see this in the bottom? I'm gonna move these chairs out of the way. The woman who was given the sandwich from the kitchen trip. Everybody agree that this is okay, this is English? Great. Now, given and brought are synonyms in this context. So I can also change given to brought. The woman who was brought the sandwich from the kitchen trip. These mean more or less the same thing, module of the slight differences in the meaning of bringing and giving. Does that seem right? Everybody convinced both of these are fine. So now there's a rule in English that allows me in a wide variety of contexts to take the words who was and just take them away. That's called taking, so who was given the sandwich is what's called a relative clause. And there's a rule that says a who was initiated relative clause can have the words who was removed. So for example, I can remove the words who was from the bottom sentence and I get the woman given the sandwich from the kitchen trip. That's okay to everybody too? Yeah, great, okay. So now I should just be able to do the same thing, same operation, and I get this extraordinarily confusing sentence. Of course, your subjective experience is different from that. It's a variance perhaps, which is he feels that something's wrong still with the sentence. The reason for that is well accounted for by a hierarchal symbolic structural descriptions. The reason is that this is the structure that the sentence should have, but actually there's a sort of phantom really appealing structure that looks very different. And in particular, the subject of the sentence, the woman has, as the rest of the sentence, brought the sandwich from the kitchen. That's the woman is doing the bringing, but if you have that interpretation, there's no way to accommodate tripped. Whereas in the right interpretation, the actual interpretation that should be available, the woman brought the sandwich from the kitchen is the subject of the sentence and tripped is the thing that the woman is doing. So it should be equivalent to this meaning. But that's hard for humans. And we actually have theories, we have cognitive theories, computational theories of why this would be hard for a human. And the reason is that we're taking tree structured models that have probabilities that allow us to put preferences on the likely interpretations that are summarized by these trees and these unfold incrementally because human language processing is incremental. And this is happening from left to right. So for example, for this kind of sentence, you would have two possible alternatives in the size of these trees you can sort of think of as their relative preferences and you just wind up with an extraordinarily strong preference for one of them and you may lose the ultimately correct interpretation altogether. If you don't lose it, you'll be able to understand, but if you did lose it early on, you won't be able to understand. So that's human language processing, highly incremental. This incrementality is also reflected in human behavior. So for example, reading. So I think you all have this objective experience every day that when you read, it's sort of like a continuous scanning of the text. Does that seem right? It's an incredible illusion that the mind actually plays on you because this is actually what the eyes are doing when you're reading. Actually, the eyes are spending most of the time stationary and then they make extraordinarily rapid jumps called saccades. And so we can summarize reading behavior and sort of this moment by moment how much time you're spending on each word and where do you go next? Fixations and saccades. And we can incrementally, this generally is progressive, but sometimes when we get confused, we move back. And this carries an extraordinary, and there's timing, and this carries an extraordinary amount of information about what happens in real time. And we can ask the question, well, why is some words harder than others? And there's a simple hypothesis actually, which turns out to be extremely powerful, which is that the number of bits in the information theoretic sense is an index of how difficult a word is in its context. And we can use these behavioral measures to actually test the hypothesis. So what you can do is we can have people read naturalistic data sets and we can look at the relationship between the number of bits, which is a measure of log probability and how much time people spend reading each word of the sentence using those moment by moment measures that we can get, for example, from eye tracking and using either eye tracking or another commonly used method, we find that actually the relationship between log probability of a word, which is something that you might get out of an AI language model, is pretty much linear and has a linear effect on how long people spend reading, which is a wonderful scientific result because it's not obvious that actually prediction would have that kind of ubiquitous impact even for very low predictability words. But also you can imagine that this might set the stage for a whole range of kinds of cognitive ergonomics applications, where if it's important for somebody to be understanding language in real time, you want to be able to monitor whether they're actually responding, whether the words that convey a lot of information will be surprising are ones that people are actually attending to more and the ones that are really predictable are ones that people skip over. For the sake of time, I want to actually jump over a couple of other pieces of what I have prepared, but I do want to mention briefly that we actually have taken these, one thing you can do with machine learning techniques is that you can take human eye movements and you can basically create something, I think I would say we're about two thirds of the way to a replacement for the TOEFL exam by using eye movements. So for example, what we can do is we can have people just read sentences like CNN wants to change its viewer's habits, we get people's eye movements out of that, and then we can actually predict things like from language processing online without actually all of the things that you'd have to do to get somebody to do a regular TOEFL exam. We can predict English proficiency and what we can do is we can take a featureized representation of that eye movement pattern and create, we get a bunch of native speakers to read text and we get sort of like a native language prototype representation of what reading behavior looks like, and then we can compare that to an ESL learner and look at how close they are. And this actually is very strong in predicting there, in this case this is the Michigan English text which is like an open source version of the TOEFL, but it's actually there's a pretty good correlation and we're continuing to refine this, a pretty good correlation between the predictions that we can, between the similarity to a native English speaker's eye movement patterns and how well somebody will perform on a standardized English test. But I wanna spend just a few minutes, we're close to lunchtime, but I wanna use the kinds of insights that we get from studying human language processing to study really actually uncover the strengths and weaknesses of contemporary AI models of language which are of course everywhere today. So we can ask the question, what has state of the art AI learned is English? And really I think that the experimental and technical repertoire of linguistics and cognitive science puts this in a really good position to probe in a sense to find the soft underbelly of AI models. So I think you probably all have the experience of reading something a text that GPT2 automatically produces and it seems remarkably fluent, but you don't know how much is sort of more or less copy-pasted from its massive billion or multi-billion word training set and how much is really human-like productivity. And so for example, we could take a relatively recent state-of-the-art LSTM model which is trained on something along the lines of a human's lifetime's worth of experience of language and we can give it carefully crafted beginnings of sentences and see what it thinks should come next. So here are some examples that I did myself and I'll comment on them briefly. So for example, the girl who the newspaper, how might this continue? This is how this model finishes the sentence. The girl who the newspaper now calls his girlfriend has been really hateful. So that's a little weird, but actually it's grammatically really, really good. The girl who the newspaper calls, it has the right number of verbs, everything lines up, it's just sort of a weird meaning. So I'm gonna give that a check mark. The monologue that the actor who the movie industry, this is a deviously designed sentence that I've used psycholinguistics techniques to set up. I know that this is a kind of sentence that would be very hard for a human. You can think about it yourself. You might say, I don't know how to complete that sentence. Well, here's what the model does. The monologue that the actor who the movie industry likes made silent was being uploaded. And actually, this is great. This is an incredible success. It seems it's a weird meaning, but look, so the movie industry likes the actor, the actor made the monologue silent and the monologue is being uploaded. Totally fine. This is a great success. Of course, success is not always there. The man who the car has gazed long in the ad for years period, no good. The athlete who the restaurant would decide justify to add the main West Coast restaurants into his menu and who hadn't upgraded from his previous suite into a more unknown word, steakhouse. I don't even know how to categorize that. So here's the situation that we're in. And this is an instance of the broader questions of explainability, of interpretability and safety. We basically have aliens that are now being exposed to natural language and they're learning something out of it, but we don't know what they're doing. And we wanna understand what they're doing. We wanna understand what kind of generalizations are being made, but actually at the same time this offers an opportunity to answer a longstanding theoretical question in cognitive science, which is as following, to a first approximation, a child learning their native language learns it from positive input alone. That is, when the child makes a mistake, the adult in practice actually rarely corrects the child. It's also a child language development research shows that even when the parent corrects the child, the child doesn't usually pay attention. So really what the child is learning the input from is from positive data, from examples of what is okay. And the generalization problem is to figure out what's possible from instances of what's okay. And so we can actually take these state-of-the-art models which actually if you look at them technically don't really typically have something that looks like a human-like inductive hierarchical bias. And we can ask them, we can ask what kinds of generalizations emerge simply from the language data all alone, and what kind of generalizations fail to emerge without the right inductive bias being baked in. And I think this is a tremendous scientific opportunity. It also highlights a really important thing that I want everybody to remember here, which is that for language, AI for the world's languages is a small data problem. It's not a big data problem. For a handful of languages, it's a big data problem. But there are six to 7,000 languages in the world. Many of them are endangered. Estimates are that by the end of this century, 50 to 90% of the languages of the world may be lost. This is actually a more severe crisis than the species extinction in fact by the raw numbers. And of course there are a handful of languages that are spoken many. That are spoken by many native speakers. So 85% of the world's population speaks one of the top 100 languages in the world as a native speaker. But that leaves hundreds of millions of people not in that 85%. And even in those 100 languages, most of those languages, we don't have anything near the kinds of data sets and resources that we do for the largest language in the world like Mandarin, which has almost 15% of the world's speakers, or English, 5%, Spanish, 6% and so forth. So this is really, this is in terms of the language diversity of the world. Big data, it's not a big data problem. We have a moral obligation and in fact there is economic opportunity in bringing language technology to people all over the world, regardless of their native language. But for most of the world's languages, we're not gonna be able to do that with huge data. And so in our work, and this has been collaborative work with people at IBM, and it's been very, very exciting and fun and I've learned a lot, is we take different sort of architectures for how sentences come into being. So I gave you a quick sort of snapshot of a sequence based architecture, which is the most common sort of way of thinking about how sentences come about as generative models. And this says that there is, for example, this is the recurrent model, but there's also the transformer based models that are sequential. And basically, words are generated from left to right by some kind of process that the consolidates has some consolidated representation of the previous state and generates the next thing. And when the next thing is generated, it then becomes the most recent thing and that continues recursively. But drawing on the long standing body of theory and work in mathematical analysis in linguistics, there are also tree based models. And so tree based models would say that fundamentally there are these underlying latent phrasal structures. And to me, one of the most exciting kinds of architectures is one in which you are generating symbolic structures and words are sort of the visible part of that symbolic structure, but there's latent parts of that. But the controller that actually figures out what it should do next is actually a neural controller. And so the neural controller says what kind of symbolic operation should I do next? And I think this is a kind of setup that gives you, in practice, it performs very well and it allows you to sort of make more explainable to some extent as well, the kinds of why certain next words are being predicted. And we can also explore it doesn't carry advantages. And so we have some comparisons that try to reveal the relative strengths and weaknesses. So we have very big data models and then we have two small data models, one of which is a sequential and one of which is a hierarchical structural tree-based model. And so there's models, the sequential models, have no explicit syntactic representations and the recurrent grammar models do. But of course we don't know exactly the generalizations that they'll make from neural control. So here's an extremely, and I'll just give you two examples, simple examples of this. So one of the simplest kinds of things we could possibly do is ask, for example, expectations not at the level of the next word, but the next abstract structural unit. So for example, is this an okay sentence? The doctor studied the textbook. Perfectly fine, right? What about this? As the doctor studied the textbook, that sentence feels like it falls off a cliff, doesn't it? And the reason is that the word as does what's called subordination. It says that the next clause that I get is not the main clause of the sentence, but it's something subordinate to the main clause. And I still have to get the main clause. So that's not okay. The doctor studied the textbook, comma, the nurse walked into the office, that probably feels a little weird. As the doctor studied the textbook, comma, the nurse walked into the office, perfectly fine, okay? And so we can actually use, test any of these kinds of incremental models, neural language models, by taking a context and asking how likely does it think the natural to the human continuation is? And we can judge this using this notion of surprise, number of bits that I talked about that has cognitive reality and also is it appropriate a very natural metric for evaluating language models. And so we wanna know, for example, so here's an interesting thing about this set of four sentences. The top two don't have what's called a matrix clause, they don't have a main clause at the end, the bottom two do. The first and the third don't have the word as, the second and the fourth do. And what we should see is an interaction between those features. So it's the top and the bottom one. Either you should have this second clause and you should have the second clause in and as at the beginning or you should not have a second clause and have no as at the beginning. And so if we take the difference in surprise all of these continuations, this should be a positive difference, this should be more surprising than the top, but this one should be a negative difference. This should be less surprising than the top. And so we can see what, for example, the big data models do, these are 890 million words respectively. And so what we find is that the red lines are the top two comparison and they actually don't do a great job with that. In particular, the model trained on a human lifetime's worth of data doesn't care about ending the sentence after as the doctor walked into the office at all. It doesn't care about that at all. They both do a good job of using the word as as a cue though, a positive cue to expect another clause after the first clause. So they both succeed in that and then we can also look at the small data models. And here's something we see that's really striking. The small data model that is purely sequential learns almost nothing about this regularity, but the structurally supervised model actually learns the regularity most robustly. In fact, it learns more robustly than you would get with one or two more orders of magnitude of data. And the very last thing I'll mention is that we can then come back to the early on example that I gave near the beginning of the talk. The woman brought the sandwich from the kitchen trip and we can look at all four versions of the sentence and we can do exactly the same thing. And what we should see once again, and this we should see actually slightly different kind of pattern here, which is the following. So you notice that those words who was in the broad version of the sentence make it really clear what the structure is and it should make the word tripped much less surprising at the end than it would have been without the words who was. But it shouldn't matter as much for who was for the given version, because given where as brought is the same form. It could either be this passive thing or it could be the woman's doing the bringing. Given is a form that doesn't allow the woman to be doing the giving. And so this difference should matter much less. So this should be a smaller difference. The difference between these should be smaller than the difference between those. And from the small, I'll just give you the small data models here. And it turns out that first of all, the small data model that sequence based learns very little. Well, it actually does get the generalization but it's not a very robust generalization. It's a very strong and robust generalization for the neural models. And in this kind of work we've done this for a wide variety of other kinds of things, other kinds of structures and other kinds of contexts and compared them against well-established results in the human language understanding literature. So I'll just end by saying that I've lost my conclusion slide. But let me just actually, I wanna mention one really brief thing. So one way we're scaling this up because of the value of connecting the insights of cognizance and linguistics with the tools of AI, we're trying to create an actual easy way and easy interface for these communities to interact. And we're doing this through syntaxjim.org which is currently in development. But we're developing it as a clearinghouse for AI researchers to bring their models to test against carefully crafted, linguistically informed tests like this and conversely for people who don't have the AI background but have the linguistics or cognitive science background to develop tests and test them against AI models. And I just wanna close by saying that once again, bringing together the models, the theory, the data and the experimental techniques really I think is it's more important and there's more opportunity for doing that, for advancing the technology and the science than I've seen at any time in my career in the field. And I'm really excited about it. And I'd like to just close by thanking you for listening and enjoy lunch everybody.