 Now let's welcome our next presenter, Britta from MIT. Hi, Britta. Hey, everyone. Can you hear me fine? Yeah. Great. Let me try to share my screen. Thanks for the talk. Great. Does this look fine? Yep. So when you share your screen, don't forget to share the sounds if you have any video clips in your presentation. Sounds good. Thanks. I actually don't. So that makes things easier, hopefully less technical difficulties. Yeah, hey, everyone. Thank you so much for the opportunity to share this work today. I'm Britta. I'm a PhD candidate at the Brain and Cognitive Sciences Department at MIT. I study how humans extract meaning from text and speech using tools from artificial intelligence. And the high-level goal of this project is to obtain a better understanding of the human language network by driving and suppressing neural responses in this network with the help of large language models. And first, let me try to unpack that a bit. First of all, what is the system that we are investigating? While you are listening to these words, you are engaging the language network in your brain. This is a set of regions in the frontal and temporal part of the brain. In the left hemisphere in most individuals, here shown by these red demarcations on the inflated brain surface. The color here shows the probability of a voxel at given location in the brain to be selective to language, meaning that part of the brain responds to linguistic input, such as naturalistic sentences versus strings of non-words or degraded speech. So these regions are sensitive to word and sentence level meanings. And here, we estimated this map across 800 participants. So most individuals have this frontal temporal network in the brain that we use to process language. This network is known to support language comprehension and production of both spoken, written, and signed languages. And this network is very selective for language relative to various other inputs, meaning that other types of input, such as music or arithmetic, want to engage this network. And finally, if parts of this network are damaged in adulthood, we end up with linguistic deficits. However, even though we know that this network supports language processing, many aspects of the representations and algorithms that actually support language comprehension remain unknown. Basically, what is going on in this network when we process language? And this is the question that we try to tackle with the help of large language models. Moving on to the language models. How do they come into play? How are they relevant to this question? First of all, large language models, LLMs, are trained to predict the next word. Next, LLMs have been shown to be predictive of brain responses during language processing. And then, given that we now have these models, the LLMs that are able to predict brain responses during language processing, can we then leverage the predictive power of LLMs to identify new stimuli to maximally drive or suppress brain responses in the language network? So this is the main question. Here, we try to ask, can we identify new stimuli, some kind of sentences that will drive the responses up in this language network? Meaning that some kind of input, for instance, here I show a sentence, my skin feels like melting wax. Would that kind of input drive the responses up? Or conversely, can we find some kind of sentences that would suppress the responses in this network to lower the responses as much as we possibly can? And by asking this question, we tap into two sub-questions. One is, are LLMs accurate enough to non-invasively control brain responses implicated in higher level cognition, such as the language network? Second, what kinds of linguistic input is the language network most responsive to? And this logic of understanding a brain region or a neuron better by figuring out what it is most responsive to is, of course, not new. And the idea that neurons have certain stimuli that they respond more strongly to than others dates back to the pioneering work of Hubel and Wiesel, who demonstrated how some neurons are more responsive to various orientations of the stimuli. In that way, this is the same logic that we try to apply here, just to the language network in the brain. So first of all, let me try to go over the approach followed by results and then some conclusions and perspectives. So the approach here, as I mentioned, is to build a predictive model of how the language network responds to any arbitrary sentence. Then we want to use this predictive model to identify sentences that elicit maximal activity or minimal activity in this network. The sentences that elicit maximal activity I'll denote as drive sentences and they will be shown in red in this presentation while the sentences that we want to elicit minimal activity, we denote suppressed sentences and I show them in blue in this presentation. So first, let me go over how we build this predictive model. We have a set of humans, in this case, five individuals that are exposed to 1,000 diverse sentences and we record their brain responses in fMRI. For each single participant, we obtain the response for each single of these 1,000 sentences and then because we are interested in building a model that can predict the response to the sentences in any arbitrary human's brain, then we average the responses across these five participants in the language selective regions. That means that for each single sentence, we end up with one value of how this network responds is the response high or low to any given of these 1,000 sentences. On the modeling side, we also have our LLM, our large language model, in this case GPT2 Excel, which we also expose to the same 1,000 sentences and then we record the internal unit activations, the representations of this network while this network is processing the same 1,000 sentences that the humans were exposed to. In that way, we can leverage the activations from GPT to fit an encoding model that can then predict the activation associated with each single sentence. So that means that this model can take as input any sentence and output the predicted language network response. So this is a very, very useful predictive model. Now let me move on to the second part of the approach, which is now we want to leverage this predictive model to identify these sentences that we hope will elicit strong or low activity in the language network. So now I want to identify the sentences, the drive or the suppress sentences. And we do that by looking across many, many sentences, in this case 1.8 million sentences that are extracted from various text corpora. For each of these two million sentences, we feed these sentences through our model GPT and then we obtain the representations associated with each of these almost two million sentences. And now we can leverage our predictive model to generate predicted brain responses for each single of these almost two million sentences. That means that across all these sentences, we can obtain a predicted brain response. And this prediction means what is the associated brain activity with given the sentence as fitted by our predictive model. Then because we're interested in obtaining sentences that either drive or suppress this network, we can then sort our predictions. Up here, we might have the sentences that have the highest predicted responses. And on the other hand, we can have the sentences that are associated with the lowest predictions. Then we can take the top 250 sentences and the lowest 250 sentences. And now we want to record brain activity to these new drive and suppress sentences in new participants. So that means that we are asking, can we use these models like the sentences to drive or suppress responses in the brains of new participants? And this can be thought of as a type of non-invasive control of brain activity. And you might imagine that this could be tricky because we are trying to generalize to new individuals. And also we are operating on in the domain of language and the language network extracts abstract meaning representations, which is of course a tricky thing to model. In addition to these drive and suppress sentences, we also collect data for these baseline sentences, which are 1000 sentences that are sampled from various naturalistic corpora. So now I'm going to show you the identified drive sentences that our model predicted would have high brain responses in the language network. Here are a couple of examples, just to point out a few. One is people on Insta be like gross or two in loves me not nor will. On the other hand, we also have the suppress sentences, which could for instance be they walked out onto the balcony or what else is there to do? And finally, we have the set of naturalistic baseline sentences and these are simply sentences extracted from various naturalistic corpora. Great, so now let me move on to the results. Here I'm showing the condition level responses to the drive sentences in red and the suppress sentences in blue and the baseline over here. And we collected these responses in three new participants to these 1500 sentences in total. And the Y axis shows the Z-scored bold response. And as we can see, as predicted, our drive sentences elicit strong responses in the language network and our suppress sentences managed to actually suppress the activity in this network. So by that, we can conclude that these models like the sentences successfully drive and suppress brain responses in the language network of new individuals. And we recorded these brain responses is in an event-related fMRI design, which means that each single sentence is presented as its own condition. So there's a sentence and there's a small break, a sentence, a small break. To validate that these findings would generalize to other experiments of paradigms, we also collected the responses to drive and suppress sentences in a more traditional blocked fMRI design. And we saw, again, that these drive and suppress sentences are actually able to control the responses in the language network as we have predicted. So this is on the condition level. And now, because we collected responses to individual sentences, we can look at these individual sentence level responses. And by that, we can ask two questions. One is how accurate was our model at predicting responses to individual sentences? And the second one, what are the individual sentences that elicit the highest or lowest brain responses? So now I'll be showing your scatterplot of the sentence level brain responses versus the predictions from the model. And that's the plot that we're looking at here. On the X-axis, we have the predictive models predictions of what the brain response would be. And that's why we have this clustering of these red drive sentences on the right. These sentences were designed to elicit as high activity as possible. And on the other end of the X-axis, we have the blue suppress sentences that were designed to elicit minimal activity in this network. And the gray sentences here are the naturalistic baseline sentences. They're pretty widely distributed on this scale. On the Y-axis, we have the averaged discord ball response from the three participants that we collected data from. And first of all, we can ask how good is the performance of this model? We see that the correlation is indeed positive. And in simulations, we quantify that this is 69% of the theoretically obtainable correlation given inter-participant variability and measurement noise. And then second, we can appreciate what are the sentences that actually elicit highest activity in the language network. And the highest eliciting sentence is I am progressive and you fall, right? And some of the sentences that would elicit very low responses would be a sentence like she wore a short black dress. And now the key question that I started out with is, well, why do some sentences elicit higher brain responses than others? Can we learn more about this network by quantifying what is actually special about the sentences that elicit high responses versus low responses? So to do that, we collected sentence ratings from M-Turk, from crowdsourced participants, where, which I'm showing on the X-axis here, where for each single sentence that we collected data for, we asked M-Turk workers to provide a rating, for instance, on the grammaticality or the plausibility of the sentence. Here lower numbers are less grammatical, less plausible, and higher numbers are more grammatical and more plausible. And on the Y-axis, we have, again, the bold response, the brain responses that were collected, and these graphs show the sentence level responses, where each single point is a sentence. The insets here show the averaged brain responses in the bins according to the ratings. So first of all, on these two scatter plus, we can see that sentences that are in the mid-range of grammaticality and plausibility elicits the highest brain responses. We see this inverted U-shape. That means that the language network responds strongly to sentences that are normal enough to engage at, but weird enough to tax it. So that means that in general sentences that are unusual, they might have a weird grammatical structure. They might be implausible. They will elicit higher brain responses due to their unexpected form and or meaning, but if a sentence is highly ungrammatical and very implausible and it does not adhere enough to the natural statistics of language, then the system does no longer respond to it. Another feature that we investigated was the emotional balance. And the trend that we evidenced here was that sentences with negative content here in this end of the axis elicit higher brain responses. And the final feature that I want to talk about here is imageability, which is how easy is a given sentence to visualize. And the trend that we found here was also negative, meaning that sentences that are abstract and hard to visualize elicit higher responses. And by that, I want to write a few conclusions and perspectives. And here I'm putting up the initial main question slide that I had. And for the first sub-question, are LLMs actually accurate enough to non-invasively control brain responses implicated in higher level conditions such as the language network? We saw that, yes, they actually are. We are able to use these models as tools to generate stimuli that can drive or suppress brain responses. And for the next question about what kinds of linguistic input is a language network most responsive to basically what kind of stimuli drives the system the most? We found that these sentences that have these unusual grammatical structures and somewhat odd meanings really drive the system but they can become too unusual. So that means that this system is operating on this scale of like if there's some interesting information to extract then this network will work hard to do that but if the input becomes too ill-formed the network will no longer respond. And we also found that sentences that have negative concept or are very abstract and hard to visualize also elicit high responses. And we put out this paper as a preprint very recently with these brilliant collaborators and we investigated a couple of other features and a bunch of other analyses. So you can check that out if you're interested. And I just wanna finalize with this perspective slide that within audition, we have deep neural network mods that can accurately predict brain responses in the auditory cortex. And a couple of studies have shown this over the last few years and in recent work in collaboration with Janelle Federer, Dana Boebinger and Josh McDermott we quantified how different deep neural network models here in the X axis can predict the responses of auditory cortex voxels. And we found that actually many of these deep neural deep neural network models are very good at predicting the responses in the auditory cortex. And that means that this leaves interesting alleys of work of using these deep neural network models for audio to potentially provide a high fidelity control of subdivisions within the auditory cortex. And by that I'll say thanks a lot for listening and feel free to reach out if you have any questions or comments. Thanks. Hi Greta, great talk. Thank you, it's very interesting. So I've got a couple of questions in the chat. So John Rowan was asking, what was the task for the participants when they listened to the sentences? So when they rate the sentences, like how do you instruct them? What are they required to do during the task? Right, thanks for the question. So regarding rating and obtaining these norms for the sentences, they were instructed to, first we provide an example. Now we want you to provide a rating from one through seven of how easy a sentence is to visualize. And then we provide an example of a very concrete sentence, you know, the apple is on the table or some more abstract sentence, I am lonely. And then we ask these M2R graders to them and provide these ratings. And for the fMRI task, participants were asked to read the sentence and think about its meaning and pay attention. So there was no explicit task in the fMRI paradigm because we wanted to try to make this as close as possible to more naturalistic paradigm. Thank you, that was very clear. And another question from Sebaad is that, did you run your predictive model on actual clinical sentences to know if there are more drive slash baseline slash surprise, surprise characteristics, which would be the impact on the clinic if there's a tendency towards one of them? Yeah, cause in the clinical, especially for audiology, sometimes we test the speech and noise performance to kind of evaluate how good the participant or the patient can do speech understanding, especially in noisy environment. So some of the commonly used materials include hearing noise tasks and also the as a bell sentences. Those speech materials were designed phonetically balanced and also they say that it's, they have equal difficulty in terms of the grammar. So also I'm very curious about that too, like if you run your model on those materials, would they predict different brain responses to particular sentence versus the others? Yeah, thanks for that question. Just one note before I start answering. In this paradigm, we actually had participants read, but in the language network, which is responding to both the auditory and the visual modality, there shouldn't be a big difference, but just want to make that clear. I'm sorry if that wasn't incredibly clear for my talk. And my answer to the question is that, no, I have not run this predictive model on the clinical sentences. I actually never really thought about that before, but it's very, very doable. And I think that does demonstrate the strength of these predictive models because we can take input any sentence, right? There's not limited to any kinds of vocabulary, anything these elements are able to take any given word and just provide the representation for it. So that should definitely be, there's definitely doable. And yeah, I think you can see what the predictions would be. Yeah, I'm not gonna fast-forward. Keep that in motion once you run your model on those sentences because your research apparently is very effective. Yeah, it's very interesting. And one more question, this is very creative. It's like, could you use your model to read one's presentation? Like how easy it is to listen to an instructor? For example, your style of creating sentences just based on the model prediction, like you mentioned in your presentation that if something unusual pop up in a sentences, the listeners tend to give higher, more active brain responses, right? So that means that if you run model to someone's lecture and then it showed up a very active brain responses, doesn't mean that the instructor's not doing a good job of conveying the information. Yeah, so that's a fun question. Say that we have a lecture and then we run that through the predictive model, then we would be able to get a proxy for when the language network is working hard, when the responses are high. And when the responses are low, you could probably get a good idea as to when people are maybe checking their phones or when like this content is not as like stimulating or exciting because we do see that in sentences where they're surprising that there seems to be a lot of information to extract from, then the responses go higher. So I guess if I were to structure a lecture using that method, I'll probably try to create some kind of somewhat uniform density of the spread across the lecture so you wouldn't completely lose the students. But yeah, that's a fun one. I guess it could be some more educational aspects. Yeah, thank you for the answer. And okay, one more question. Do you expect a sentence will affect the next sentence like coherence in a story? Yep, that's a very, very important one. And in this particular paradigm, we say that one sentence is its own unit, which is of course a big assumption because usually in all kinds of language processing, almost we rely a lot on prior context. And this paradigm we try to simply isolate the sentence level response. There's a short break and participants are aware that it's not a coherent story. So we try to ignore the effect of context and we also randomize sentences across participants so that of course helps, but it is a big assumption. And of course in more ecologically valid settings then we do rely a lot on context. And my prediction is that if we feed prior context in the model then it should be able to account for whether something surprising or not. Right, there's this classical example where if you hear a sentence like the peanut fell in love then that's super surprising. That's like a weird sentence. But if you first heard a story about like the peanut, you know this and this and you have a lot of prior context then that sentence that I mentioned first is not gonna be surprising. And if the model can hear and talk about GPT is able to provide a good enough representation of context then the predictions for the brain response should be able to go up and down accordingly. But that is, as I just said, dependent on whether GPT can provide the context in a meaningful way, which I have not tested. Yeah, thank you. So based on the previous question, okay, I promise this is the last question. Yeah, no. Based on that, have you tried sentences from a foreign language or random text in addition to the baseline? Cause I think this question is pretty much related to the context we talked about earlier because the random, a foreign language that the participant may not know could be something surprising, but also on the other hand it could be something that may not elicit a language linguistic processing of the brain. So in that case, would you predict that your model would give a super active or surprised response? I think that the model for foreign language input, I have not tested it, but for foreign language input, I believe that the current model would predict that to elicit high responses just because it's like odd and it's, it's different in that sense, but that's a fault of the model. So when we fitted this model, we mostly had pretty naturalistic linguistic input. So we actually did not include these like very, very odd strings or foreign language or word lists. Had we done that, I think we would have been able to get the model to predict that very, like random word list or foreign text would elicit lower responses, which I do predict would be empirically the case, but I don't think the current state of the model would actually get at that. And that's the fault of the predictive model. Yeah, I feel the same. Well, thank you for answering all the questions. Yeah, thanks for the great questions. It's very interesting. We have enjoyed it.